Currently the exception handling code of feed_writer() assumes
consume_end_of_stream() doesn't throw. This is false and an exception
from said method can currently lead to an unclean destroy of the writer
and reader. Fix by also handling exceptions from
consume_end_of_stream() too.
Closes#10147
(cherry picked from commit 1963d1cc25)
If the Docker startup script is passed both "--alternator-port" and
"--alternator-https-port", a combination which is supposed to be
allowed, it passes to Scylla the "--alternator-address" option twice.
This isn't necessary, and worse - not allowed.
So this patch fixes the scyllasetup.py script to only pass this
parameter once.
Fixes#10016.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220202165814.1700047-1-nyh@scylladb.com>
(cherry picked from commit cb6630040d)
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.
Fixes#9540
----
This reverts 0d8f932 and introduce correct fix.
Closes#9970
* github.com:scylladb/scylla:
scylla_raid_setup: use mdmonitor only when RAID level > 0
Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"
(cherry picked from commit df22396a34)
The "Authorization" HTTP header is used in DynamoDB API to sign
requests. Our parser for this header, in server::verify_signature(),
required the different components of this header to be separated by
a comma followed by a whitespace - but it turns out that in DynamoDB
both spaces and commas are optional - one of them is enough.
At least one DynamoDB client library - the old "boto" (which predated
boto3) - builds this header without spaces.
In this patch we add a test that shows that an Authorization header
with spaces removed works fine in DynamoDB but didn't work in
Alternator, and after this patch modifies the parsing code for this
header, the test begins to pass (and the other tests show that the
previously-working cases didn't break).
Fixes#9568
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211101214114.35693-1-nyh@scylladb.com>
(cherry picked from commit 56eb994d8f)
Although the DynamoDB API responses are JSON, additional conventions apply
to these responses - such as how error codes are encoded in JSON. For this
reason, DynamoDB uses the content type `application/x-amz-json-1.0` instead
of the standard `application/json` in its responses.
Until this patch, Scylla used `application/json` in its responses. This
unexpected content-type didn't bother any of the AWS libraries which we
tested, but it does bother the aiodynamo library (see HENNGE/aiodynamo#27).
Moreover, we should return the x-amz-json-1.0 content type for future
proofing: It turns out that AWS already defined x-amz-json-1.1 - see:
https://awslabs.github.io/smithy/1.0/spec/aws/aws-json-1_1-protocol.html
The 1.1 content type differs (only) in how it encodes error replies.
If one day DynamoDB starts to use this new reply format (it doesn't yet)
and if DynamoDB libraries will need to differenciate between the two
reply formats, Alternator better return the right one.
This patch also includes a new test that the Content-Type header is
returned with the expected value. The test passes on DynamoDB, and
after this patch it starts to pass on Alternator as well.
Fixes#9554.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211031094621.1193387-1-nyh@scylladb.com>
(cherry picked from commit 6ae0ea0c48)
On CentOS8, mdmonitor.service does not works correctly when using
mdadm-4.1-15.el8.x86_64 and later versions.
Until we find a solution, let's pinning the package version to older one
which does not cause the issue (4.1-14.el8.x86_64).
Fixes#9540Closes#9782
(cherry picked from commit 0d8f932f0b)
This reverts commit 146f7b5421. It
causes a regression, and needs an additional fix. The bug is not
important enough to merit this complication.
Ref #9311.
The shard reader can outlive its parent reader (the multishard reader).
This creates a problem for lifecycle management: readers take the range
and slice parameters by reference and users keep these alive until the
reader is alive. The shard reader outliving the top-level reader means
that any background read-ahead that it has to wait on will potentially
have stale references to the range and the slice. This was seen in the
wild recently when the evictable reader wrapped by the shard reader hit
a use-after-free while wrapping up a background read-ahead.
This problem was solved by fa43d76 but any previous versions are
susceptible to it.
This patch solves this problem by having the shard reader copy and keep
the range and slice parameters in stable storage, before passing them
further down.
Fixes: #9719
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211202113910.484591-1-bdenes@scylladb.com>
(cherry picked from commit 417e853b9b)
Unfortunately, defining metrics in Scylla requires some code
duplication, with the metrics declared in one place but exported in a
different place in the code. When we duplicated this code in Alternator,
we accidentally dropped the first metric - for BatchGetItem. The metric
was accounted in the code, but not exported to Prometheus.
In addition to fixing the missing metric, this patch also adds a test
that confirms that the BatchGetItem metric increases when the
BatchGetItem operation is used. This test failed before this patch, and
passes with it. The test only currently tests this for BatchGetItem
(and BatchWriteItem) but it can be later expanded to cover all the other
operations as well.
Fixes#9406
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210929121611.373074-1-nyh@scylladb.com>
(cherry picked from commit 5cbe9178fd)
In order to avoid race condition introduced in 9dce1e4 the
index_reader should be closed prior to it's destruction.
This only exposes 4.4 and earlier releases to this specific race.
However, it is always a good idea to first close the index reader
and only then destroy it since it is most likely to be assumed by
all developers that will change the reader index in the future.
Ref #9704 (because on 4.4 and earlier releases are vulnerable).
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Fixes#9704
(cherry picked from commit ddd7248b3b)
Closes#9717
We were silently ignoring INSERTs with NULL values for primary-key
columns, which Cassandra rejects. Fix it by rejecting any
modification_statement that would operate on empty partition or
clustering range.
This is the most direct fix, because range and slice are calculated in
one place for all modification statements. It covers not only NULL
cases, but also impossible restrictions like c>0 AND c<0.
Unfortunately, Cassandra doesn't treat all modification statements
consistently, so this fix cannot fully match its behavior. We err on
the side of tolerance, accepting some DELETE statements that Cassandra
rejects. We add a TODO for rejecting such DELETEs later.
Fixes#7852.
Tests: unit (dev), cql-pytest against Cassandra 4.0
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#9286
(cherry picked from commit 1fdaeca7d0)
As reproduced in cql-pytest/test_json.py and reported in issue #7911,
failing fromJson() calls should return a FUNCTION_FAILURE error, but
currently produce a generic SERVER_ERROR, which can lead the client
to think the server experienced some unknown internal error and the
query can be retried on another server.
This patch adds a new cassandra_exception subclass that we were missing -
function_execution_exception - properly formats this error message (as
described in the CQL protocol documentation), and uses this exception
in two cases:
1. Parse errors in fromJson()'s parameters are converted into a
function_execution_exception.
2. Any exceptions during the execute() of a native_scalar_function_for
function is converted into a function_execution_exception.
In particular, fromJson() uses a native_scalar_function_for.
Note, however, that functions which already took care to produce
a specific Cassandra error, this error is passed through and not
converted to a function_execution_exception. An example is
the blobAsText() which can return an invalid_request error, so
it is left as such and not converted. This also happens in Cassandra.
All relevant tests in cql-pytest/test_json.py now pass, and are
no longer marked xfail. This patch also includes a few more improvements
to test_json.py.
Fixes#7911
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210118140114.4149997-1-nyh@scylladb.com>
(cherry picked from commit 702b1b97bf)
Before this patch when writing an index block, the sstables writer was
storing range tombstones that span the boundary of the block in order
of end bounds. This led to a range tombstone being ignored by a reader
if there was a row tombstone inside it.
This patch sorts the range tombstones based on start bound before
writing them to the index file.
The assumption is that writing an index block is rare so we can afford
sortting the tombstones at that point. Additionally this is a writer of
an old format and writing to it will be dropped in the next major
release so it should be rarely used already.
Kudos to Kamil Braun <kbraun@scylladb.com> for finding the reproducer.
Test: unit(dev)
Fixes#9690
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit scylladb/scylla-enterprise@eb093afd6f)
(cherry picked from commit ab425a11a8)
Indexed queries are using paging over the materialized view
table. Results of the view read are then used to issue reads of the
base table. If base table reads are short reads, the page is returned
to the user and paging state is adjusted accordingly so that when
paging is resumed it will query the view starting from the row
corresponding to the next row in the base which was not yet
returned. However, paging state's "remaining" count was not reset, so
if the view read was exhausted the reading will stop even though the
base table read was short.
Fix by restoring the "remaining" count when adjusting the paging state
on short read.
Tests:
- index_with_paging_test
- secondary_index_test
Fixes#9198
Message-Id: <20210818131840.1160267-1-tgrabiec@scylladb.com>
(cherry picked from commit 1e4da2dcce)
We need stopwaitsecs just like we do TimeoutStpSec=900 on
scylla-server.service, to avoid timeout on scylla-server shutdown.
Fixes#9485Closes#9545
(cherry picked from commit c9499230c3)
In commit 11a8912093 (gossiper:
get_gossip_status: return string_view and make noexcept)
get_gossip_status returns a pointer to an endpoint_state in
endpoint_state_map.
After commit 425e3b1182 (gossip: Introduce
direct failure detector), gossiper::mark_dead and gossiper::real_mark_alive
can yield in the middle of the function. It is possible that
endpoint_state can be removed, causing use-after-free to access it.
To fix, make a copy before we yield.
Fixes#8859Closes#8862
(cherry picked from commit 7a32cab524)
GKE metadata server does not provide same metadata as GCE, we should not
return True on is_gce().
So try to fetch machine-type from metadata server, return False if it
404 not found.
Fixes#9471
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#9582
(cherry picked from commit 9b4cf8c532)
There are two APIs for checking the repair status and they behave
differently in case the id is not found.
```
{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_async/system_auth?id=999", "duration": "1ms",
"status": 400, "bytes": 49, "dump": "HTTP/1.1 400 Bad
Request\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 400}"}
{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_status?id=999&timeout=1", "duration": "0ms",
"status": 500, "bytes": 49, "dump": "HTTP/1.1 500 Internal Server
Error\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 500}"}
```
The correct status code is 400 as this is a parameter error and should
not be retried.
Returning status code 500 makes smarter http clients retry the request
in hopes of server recovering.
After this patch:
curl -X PGET
'http://127.0.0.1:10000/storage_service/repair_async/system_auth?id=9999'
{"message": "unknown repair id 9999", "code": 400}
curl -X GET
'http://127.0.0.1:10000/storage_service/repair_status?id=9999'
{"message": "unknown repair id 9999", "code": 400}
Fixes#9576Closes#9578
(cherry picked from commit f5f5714aa6)
Fixes#9103
compare overload was declared as "bool" even though it is a tri-cmp.
causes us to never use the speed-up shortcut (lessen search set),
in turn meaning more overhead for collections.
Closes#9104
(cherry picked from commit 59555fa363)
The copying and comparing utilities for FragmentedView are not prepared to
deal with empty fragments in non-empty views, and will fall into an infinite
loop in such case.
But data coming in result_row_view can contain such fragments, so we need to
fix that.
Fixes#8398.
Closes#8397
(cherry picked from commit f23a47e365)
node_exporter is packaged with some random uid/gid in the tarball.
When extracting it as an ordinary user this isn't a problem, since
the uid/gid are reset to the current user, but that doesn't happen
under dbuild since `tar` thinks the current user is root. This causes
a problem if one wants to delete the build directory later, since it
becomes owned by some random user (see /etc/subuid)
Reset the uid/gid infomation so this doesn't happen.
Closes#9579Fixes#9610.
(cherry picked from commit e1817b536f)
This patch fixes a bug in UpdateItem's ReturnValues=ALL_NEW, which in
some cases returned the OLD (pre-modification) value of some of the
attributes, instead of its NEW value.
The bug was caused by a confusion in our JSON utility function,
rjson::set(), which sounds like it can set any member of a map, but in
fact may only be used to add a *new* member - if a member with the same
name (key) already existed, the result is undefined (two values for the
same key). In ReturnValues=ALL_NEW we did exactly this: we started with
a copy of the original item, and then used set() to override some of the
members. This is not allowed.
So in this patch, we introduce a new function, rjson::replace(), which
does what we previously thought that rjson::set() does - i.e., replace a
member if it exists, or if not, add it. We call this function in
the ReturnValues=ALL_NEW code.
This patch also adds a test case that reproduces the incorrect ALL_NEW
results - and gets fixed by this patch.
In an upcoming patch, we should rename the confusingly-named set()
functions and audit all their uses. But we don't do this in this patch
yet. We just add some comments to clarify what set() does - but don't
change it, and just add one new function for replace().
Fixes#9542
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211104134937.40797-1-nyh@scylladb.com>
(cherry picked from commit b95e431228)
When a tuple value is serialized, we go through every element type and
use it to serialize element values. But an element type can be
reversed, which is artificially different from the type of the value
being read. This results in a server error due to the type mismatch.
Fix it by unreversing the element type prior to comparing it to the
value type.
Fixes#7902
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8316
(cherry picked from commit 318f773d81)
Consider the following procedure:
- n1, n2, n3
- n3 is down
- n1 runs nodetool removenode uuid_of_n3 to removenode from n3 the
cluster
- n1 is down in the middle of removenode operation
Node n1 will set n3 to removing gossip status during removenode
operation. Whenever existing nodes learn a node is in removing gossip
status, they will call restore_replica_count to stream data from other
nodes for the ranges n3 loses if n3 was removed from the cluster. If
the streaming fails, the streaming will sleep and retry. The current
max number of retry attempts is 5. The sleep interval starts at 60
seconds and increases 1.5 times per sleep.
This can leave the cluster in a bad state. For example, nodes can go
out of disk space if the streaming continues. We need a way to abort
such streaming attempts.
To abort the removenode operation and forcely remove the node, users
can run `nodetool removenode force` on any existing nodes to move the
node from removing gossip status to removed gossip status. However,
the restore_replica_count will not be aborted.
In this patch, a status checker is added in restore_replica_count, so
that once a node is in removed gossip status, restore_replica_count
will be aborted.
This patch is for older releases without the new NODE_OPS_CMD
infrastructure where such abort will happen automatically in case of
error.
Fixes#8651Closes#8655
(cherry picked from commit 0858619cba)
Although the sstable name is part of the system.large_* records,
it is not printed in the log.
In particular, this is essential for the "too many rows" warning
that currently does not record a row in any large_* table
so we can't correlate it with a sstable.
Fixes#9524
Test: unit(dev)
DTest: wide_rows_test.py
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211027074104.1753093-1-bhalevy@scylladb.com>
(cherry picked from commit a21b1fbb2f)
The everywhere_topology returns the number of nodes in the cluster as
RF. This makes only streaming from the node losing the range impossible
since no node is losing the range after bootstrap.
Shortcut to stream from all nodes in local dc in case the keyspace is
everywhere_topology.
Fixes#8503
(cherry picked from commit 3c36517598)
The sstable_list is destroyed right after the temporary
lw_shared_ptr<sstable_list> returned from `cf.get_sstables()`
is dereferenced.
Fixes#9138
Test: unit(dev)
DTest: resharding_test.py:ReshardingTombstones_with_DateTieredCompactionStrategy.disable_tombstone_removal_during_reshard_test (debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210804075813.42526-1-bhalevy@scylladb.com>
(cherry picked from commit 3ad0067272)
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.
This is fixed by using new conditions in cases where
we are using an index.
Fixes#8991.
Fixes#7708.
For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 54149242b4)
Currently, if the data_size is greater than
max_chunk_size - sizeof(chunk), we end up
allocating up to max_chunk_size + sizeof(chunk) bytes,
exceeding buf.max_chunk_size().
This may lead to allocation failures, as seen in
https://github.com/scylladb/scylla/issues/7950,
where we couldn't allocate 131088 (= 128K + 16) bytes.
This change adjusted the expose max_chunk_size()
to be max_alloc_size (128KB) - sizeof(chunk)
so that the allocated chunks would normally be allocated
in 128KB chunks in the write() path.
Added a unit test - test_large_placeholder that
stresses the chunk allocation path from the
write_place_holder(size) entry point to make
sure it handles large chunk allocations correctly.
Refs #7950
Refs #8081
Test: unit(release), bytes_ostream_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210303143413.902968-1-bhalevy@scylladb.com>
(cherry picked from commit ff5b42a0fa)
`_range_override` is used to store the modified range the reader reads
after it has to be recreated (when recreating a reader it's read range
is reduced to account for partitions it already read). When engaged,
this field overrides the `_pr` field as the definitive range the reader
is supposed to be currently reading. Fast forwarding conceptually
overrides the range the reader is currently reading, however currently
it doesn't reset the `_range_override` field. This resulted in
`_range_override` (containing the modified pre-fast-forward range)
incorrectly overriding the fast-forwarded-to range in `_pr` when
validating the first partition produced by the just recreated reader,
resulting in a false-positive validation failure.
Fixes: #8059
Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>
[avi: add #include]
(cherry picked from commit c3b4c3f451)
As a function returning a future, simplify
its interface by handling any exceptions and
returning an exceptional future instead of
propagating the exception.
In this specific case, throwing from advance_and_await()
will propagate through table::await_pending_* calls
short-circuiting a .finally clause in table::stop().
Also, mark as noexcept methods of class table calling
advance_and_await and table::await_pending_ops that depends on them.
Fixes#8636
A followup patch will convert advance_and_await to a coroutine.
This is done separately to facilitate backporting of this patch.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210511161407.218402-1-bhalevy@scylladb.com>
(cherry picked from commit c0dafa75d9)
This series adds a wrapper for the default rjson allocator which throws on allocation/reallocation failures. It's done to work around several rapidjson (the underlying JSON parsing library) bugs - in a few cases, malloc/realloc return value is not checked, which results in dereferencing a null pointer (or an arbitrary pointer computed as 0 + `size`, with the `size` parameter being provided by the user). The new allocator will throw an `rjson:error` if it fails to allocate or reallocate memory.
This series comes with unit tests which checks the new allocator behavior and also validates that an internal rapidjson structure which we indirectly rely upon (Stack) is not left in invalid state after throwing. The last part is verified by the fact that its destructor ran without errors.
Fixes#8521
Refs #8515
Tests:
* unit(release)
* YCSB: inserting data similar to the one mentioned in #8515 - 1.5MB objects clustered in partitions 30k objects in size - nothing crashed during various YCSB workloads, but nothing also crashed for me locally before this patch, so it's not 100% robust
relevant YCSB workload config for using 1.5MB objects:
```yaml
fieldcount=150
fieldlength=10000
```
Closes#8529
* github.com:scylladb/scylla:
test: add a test for rjson allocation
test: rename alternator_base64_test to alternator_unit_test
rjson: add a throwing allocator
(cherry picked from commit c36549b22e)
Fixes#8749
if a table::clear() was issued while we were flushing a memtable,
the memtable is already gone from list. We need to check this before
erase. Otherwise we get random memory corruption via
std::vector::erase
v2:
* Make interface more set-like (tolerate non-existance in erase).
Closes#8904
(cherry picked from commit 373fa3fa07)
To building Ubuntu AMI with CPU scaling configuration, we need force
running mode for scylla_cpuscaling_setup, which run setup without
checking scaling_governor support.
See scylladb/scylla-machine-image#204
Closes#9326
(cherry picked from commit f928dced0c)
On Ubuntu, scaling_governor becomes powersave after rebooted, even we configured cpufrequtils.
This is because ondemand.service, it unconditionally change scaling_governor to ondemand or powersave.
cpufrequtils will start before ondemand.service, scaling_governor overwrite by ondemand.service.
To configure scaling_governor correctly, we have to disable this service.
Fixes#9324Closes#9325
(cherry picked from commit cd7fe9a998)
There will be unbounded growth of pending tasks if they are submitted
faster than retiring them. That can potentially happen if memtables
are frequently flushed too early. It was observed that this unbounded
growth caused task queue violations as the queue will be filled
with tons of tasks being reevaluated. By avoiding duplication in
pending task list for a given table T, growth is no longer unbounded
and consequently reevaluation is no longer aggressive.
Refs #9331.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210930125718.41243-1-raphaelsc@scylladb.com>
(cherry picked from commit 52302c3238)
The Red Hat packages were missing two things, first the metapackage
wasn't dependant at all in the python3 package and second, the
scylla-server package dependencies didn't contain a version as part
of the dependency which can cause to some problems during upgrade.
Doing both of the things listed here is a bit of an overkill as either
one of them separately would solve the problem described in #XXXX
but both should be applied in order to express the correct concept.
Fixes#8829Closes#8832
(cherry picked from commit 9bfb2754eb)
Fixes#8212
Some snapshotting operations call in on a single table at a time.
When checking for existing snapshots in this case, we should not
bother with snapshots in other tables. Add an optional "filter"
to check routine, which if non-empty includes tables to check.
Use case is "scrub" which calls with a limited set of tables
to snapshot.
Closes#8240
(cherry picked from commit f44420f2c9)
"
This mini-series fixes two loosely related bugs around reader recreation
in the evictable reader (related by both being around reader
recreation). A unit test is also added which reproduces both of them and
checks that the fixes indeed work. More details in the patches
themselves.
This series replaces the two independent patches sent before:
* [PATCH v1] evictable_reader: always reset static row drop flag
* [PATCH v1] evictable_reader: relax partition key check on reader
recreation
As they depend on each other, it is easier to add a test if they are in
a series.
Fixes: #8923Fixes: #8893
Tests: unit(dev, mutation_reader_test:debug)
"
* 'evictable-reader-recreation-more-bugs/v1' of https://github.com/denesb/scylla:
test: mutation_reader_test: add more test for reader recreation
evictable_reader: relax partition key check on reader recreation
evictable_reader: always reset static row drop flag
(cherry picked from commit 4209dfd753)
On some environment /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
does not exist even it supported CPU scaling.
Instead, /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor is
avaliable on both environment, so we should switch to it.
Fixes#9191Closes#9193
(cherry picked from commit e5bb88b69a)
Compaction manager can start tons of compaction of fully expired sstable in
parallel, which may consume a significant amount of resources.
This problem is caused by weight being released too early in compaction, after
data is all compacted but before table is called to update its state, like
replacing sstables and so on.
Fully expired sstables aren't actually compacted, so the following can happen:
- compaction 1 starts for expired sst A with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 2 starts for expired sst B with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 3 starts for expired sst C with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 1 is done updating table state, so it finally completes and
releases all the resources.
- compaction 2 is done updating table state, so it finally completes and
releases all the resources.
- compaction 3 is done updating table state, so it finally completes and
releases all the resources.
This happens because, with expired sstable, compaction will release weight
faster than it will update table state, as there's nothing to be compacted.
With my reproducer, it's very easy to reach 50 parallel compactions on a single
shard, but that number can be easily worse depending on the amount of sstables
with fully expired data, across all tables. This high parallelism can happen
only with a couple of tables, if there are many time windows with expired data,
as they can be compacted in parallel.
Prior to 55a8b6e3c9, weight was released earlier in compaction, before
last sstable was sealed, but right now, there's no need to release weight
earlier. Weight can be released in a much simpler way, after the compaction is
actually done. So such compactions will be serialized from now on.
Fixes#8710.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com>
[avi: drop now unneeded storage_service_for_tests]
(cherry picked from commit a7cdd846da)
The recent commit 0ef0a4c78d added helpful
error messages in case an index cannot be created because the intended
name of its materialized view is already taken - but accidentally broke
the "CREATE INDEX IF NOT EXISTS" feature.
The checking code was correct, but in the wrong place: we need to first
check maybe the index already exists and "IF NOT EXISTS" was chosen -
and only do this new error checking if this is not the case.
This patch also includes a cql-pytest test for reproducing this bug.
The bug is also reproduced by the translated Cassandra unit tests
cassandra_tests/validation/entities/secondary_index_test.py::
testCreateAndDropIndex
and this is how I found this bug. After these patch, all these tests
pass.
Fixes#8717.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210526143635.624398-1-nyh@scylladb.com>
(cherry picked from commit 97e827e3e1)
When an index is created without an explicit name, a default name
is chosen. However, there was no check if a table with conflicting
name already exists. The check is now in place and if any conflicts
are found, a new index name is chosen instead.
When an index is created *with* an explicit name and a conflicting
regular table is found, index creation should simply fail.
This series comes with a test.
Fixes#8620
Tests: unit(release)
Closes#8632
* github.com:scylladb/scylla:
cql-pytest: add regression tests for index creation
cql3: fail to create an index if there is a name conflict
database: check for conflicting table names for indexes
(cherry picked from commit cee4c075d2)
Partition count is of a type size_t but we use std::plus<int>
to reduce values of partition count in various column families.
This patch changes the argument of std::plus to the right type.
Using std::plus<int> for size_t compiles but does not work as expected.
For example plus<int>(2147483648LL, 1LL) = -2147483647 while the code
would probably want 2147483649.
Fixes#9090
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#9074
(cherry picked from commit 90a607e844)
The repair parallelism is calculated by the number of memory allocated to
repair and memory usage per repair instance. Currently, it does not
consider memory bloat issues (e.g., issue #8640) which cause repair to
use more memory and cause std::bad_alloc.
Be more conservative when calculating the parallelism to avoid repair
using too much memory.
Fixes#8641Closes#8652
(cherry picked from commit b8749f51cb)
To avoid restart scylla-server.service unexpectedly, drop BindsTo=
from scylla-fstrim.timer.
Fixes#8921Closes#8973
(cherry picked from commit def81807aa)
systemd unit name of scylla-node-exporter is
scylla-node-exporter.service, not node-exporter.service.
Fixes#8966Closes#8967
(cherry picked from commit f19ebe5709)
Listing /etc/systemd/system/*.mount as ghost file seems incorrect,
since user may want to keep using RAID volume / coredump directory after
uninstalling Scylla, or user may want to upgrade enterprise version.
Also, we mixed two types of files as ghost file, it should handle differently:
1. automatically generated by postinst scriptlet
2. generated by user invoked scylla_setup
The package should remove only 1, since 2 is generated by user decision.
However, just dropping .mount from %files section causes another
problem, rpm will remove these files during upgrade, instead of
uninstall (#8924).
To fix both problem, specify .mount files as "%ghost %config".
It will keep files both package upgrade and package remove.
See scylladb/scylla-enterprise#1780Closes#8810Closes#8924Closes#8959
(cherry picked from commit f71f9786c7)
Commit 5adb8e555c marked the ::feed_hash() and a visitor lambda of
digester::feed_hash() as noexcept. This was quite recklesl as the
appending_hash<>::operator()s called by ::feed_hash() are not all
marked noexcept. In particular, the appending_hash<row>() is not
such and seem to throw.
The original intent of the mentioned commit was to facilitate the
partition_hasher in repair/ code. The hasher itself had been removed
by the 0af7a22c21, so it no longer needs the feed_hash-s to be
noexcepts.
The fix is to inherit noexcept from the called hashers, but for the
digester::feed_hash part the noexcept is just removed until clang
compilation bug #50994 is fixed.
fixes: #8983
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210706153608.4299-1-xemul@scylladb.com>
(cherry picked from commit 63a2fed585)
Wrong comparison operator is used when checking for overlapping. It
would miss overlapping when last key of a sstable is equal to the first
key of another sstable that comes next in the set, which is sorted by
first key.
Fixes#8531.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 39ecddbd34)
CQL test relied on quietly acceptiong non-existing DCs, so it had
to be removed. Also, one boost-test referred to nonexisting
`datacenter2` and had to be removed.
(cherry picked from commit 97bb15b2f2)
Backport of 6726fe79b6.
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.
Fixes#8830
Tests: unit(release),
dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
Closes#8834
* backport-6726fe7-4.4:
view: fix use-after-move when handling view update failures
db,view: explicitly move the mutation to its helper function
db,view: pass base token by value to mutate_MV
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.
Refs #8830
Tests: unit(release),
dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
(cherry picked from commit 8a049c9116)
The `apply_to_remote_endpoints` helper function used to take
its `mut` parameter by reference, but then moved the value from it,
which is confusing and prone to errors. Since the value is moved-from,
let's pass it to the helper function as rvalue ref explicitly.
(cherry picked from commit 7cdbb7951a)
The base token is passed cross-continuations, so the current way
of passing it by const reference probably only works because the token
copying is cheap enough to optimize the reference out.
Fix by explicitly taking the token by value.
(cherry picked from commit 88d4a66e90)
LCS reshape is basically 'major compacting' level 0 until it contains less than
N sstables.
That produces terrible write amplification, because any given byte will be
compacted (initial # of sstables / max_threshold (32)) times. So if L0 initially
contained 256 ssts, there would be a WA of about 8.
This terrible write amplification can be reduced by performing STCS instead on
L0, which will leave L0 in a good shape without hurting WA as it happens
now.
Fixes#8345.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322150655.27011-1-raphaelsc@scylladb.com>
(cherry picked from commit bcbb39999b)
Due to small value optimization used in `bytes`, views to `bytes` stored
in `vector` can be invalidated when the vector resizes, resulting in
use-after-free and data corruption. Fix that.
Fixes#8117
(cherry picked from commit 8cc4f39472)
This test checks that `mutation_partition::difference()` works correctly.
One of the checks it does is: m1 + m2 == m1 + (m2 - m1).
If the two mutations are identical but have compactable data, e.g. a
shadowable tombstone shadowed by a row marker, the apply will collapse
these, causing the above equality check to fail (as m2 - m1 is null).
To prevent this, compact the two input mutations.
Fixes: #8221
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210310141118.212538-1-bdenes@scylladb.com>
(cherry picked from commit cf28552357)
On the environment hard limit of coredump is set to zero, coredump test
script will fail since the system does not generate coredump.
To avoid such issue, set ulimit -c 0 before generating SEGV on the script.
Note that scylla-server.service can generate coredump even ulimit -c 0
because we set LimitCORE=infinity on its systemd unit file.
Fixes#8238Closes#8245
(cherry picked from commit af8eae317b)
commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output).
So if there are discontiguous small memory blocks, they can be used to satisfy
an allocation even if no contiguous memory blocks are available.
To prevent that, as Avi suggested, this change allocates in 128K blocks
and frees the last one to succeed (so that we won't fail on allocating continuations).
Fixes#8028
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>
(cherry picked from commit ca6f5cb0bc)
When querying an index table, we assemble clustering-column
restrictions for that query by going over the base table token,
partition columns, and clustering columns. But if one of those
columns is the indexed column, there is a problem; the indexed column
is the index table's partition key, not clustering key. We end up
with invalid clustering slice, which can cause problems downstream.
Fix this by skipping the indexed column when assembling the clustering
restrictions.
Tests: unit (dev)
Fixes#7888
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8320
(cherry picked from commit 0bd201d3ca)
This is a follow up change to #8512.
Let's add aio conf file during scylla installation process and make sure
we also remove this file when uninstall Scylla
As per Avi Kivity's suggestion, let's set aio value as static
configuration, and make it large enough to work with 500 cpus.
Closes#8650
Refs: #8713
(cherry picked from commit dd453ffe6a)
On severl instance types in AWS and Azure, we get the following failure
during scylla_io_setup process:
```
ERROR 2021-04-14 07:50:35,666 [shard 5] seastar - Could not setup Async
I/O: Resource temporarily unavailable. The most common cause is not
enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that
number or reducing the amount of logical CPUs available for your
application
```
We have scylla_prepare:configure_io_slots() running before the
scylla-server.service start, but the scylla_io_setup is taking place
before
1) Let's move configure_io_slots() to scylla_util.py since both
scylla_io_setup and scylla_prepare are import functions from it
2) cleanup scylla_prepare since we don't need the same function twice
3) Let's use configure_io_slots() during scylla_io_setup to avoid such
failure
Fixes: #8587Closes#8512
Refs: #8713
(cherry picked from commit 588a065304)
* tools/java 6ca351c221...aab793d9f5 (2):
> nodetool: alternate way to specify table name which includes a dot
> nodetool: do no treat table name with dot as a secondary index
Fixes#6521
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In issue #5021 we noticed that the equality check in Alternator's condition
expressions needs to handle sets differently - we need to compare the set's
elements ignoring their order. But the implementation we added to fix that
issue was only correct when the entire attribute was a set... In the
general case, an attribute can be a nested document, with only some
inner set. The equality-checking function needs to tranverse this nested
document, and compare the sets inside it as appropriate. This is what
we do in this patch.
This patch also adds a new test comparing equality of a nested document with
some inner sets. This test passes on DynamoDB, failed on Alternator before
this patch, and passes with this patch.
Refs #5021Fixes#8514
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419184840.471858-1-nyh@scylladb.com>
(cherry picked from commit dae7528fe5)
In issue #5021 we noted that Alternator's equality operator needs to be
fixed for the case of comparing two sets, because the equality check needs
to take into account the possibility of different element order.
Unfortunately, we fixed only the equality check operator, but forgot there
is also an inequality operator!
So in this patch we fix the inequality operator, and also add a test for
it that was previously missing.
The implementation of the inequality operator is trivial - it's just the
negation of the equality test. Our pre-existing tests verify that this is
the correct implementation (e.g., if attribute x doesn't exist, then "x = 3"
is false but "x <> 3" is true).
Refs #5021Fixes#8513
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419141450.464968-1-nyh@scylladb.com>
(cherry picked from commit 50f3201ee2)
When a condition expression (ConditionExpression, FilterExpression, etc.)
checks for equality of two item attributes, i.e., "x = y", and when one of
these attributes was missing we correctly returned false.
However, we also need to return false when *both* attributes are missing in
the item, because this is what DynamoDB does in this case. In other words
an unset attribute is never equal to anything - not even to another unset
attribute. This was not happening before this patch:
When x and y were both missing attributes, Alternator incorrectly returned
true for "x = y", and this patch fixes this case. It also fixes "x <> y"
which should to be true when both x and y are unset (but was false
before this patch).
The other comparison operators - <, <=, >, >=, BETWEEN, were all
implemented correctly even before this patch.
This patch also includes tests for all the two-unset-attribute cases of
all the operators listed above. As usual, we check that these tests pass
on both DynamoDB and Alternator to confirm our new behavior is the correct
one - before this patch, two of the new tests failed on Alternator and
passed on DynamoDB.
Fixes#8511
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419123911.462579-1-nyh@scylladb.com>
(cherry picked from commit 46448b0983)
Currently, var-lib-scylla.mount may fails because it can start before
MDRAID volume initialized.
We may able to add "After=dev-disk-by\x2duuid-<uuid>.device" to wait for
device become available, but systemd manual says it automatically
configure dependency for mount unit when we specify filesystem path by
"absolute path of a device node".
So we need to replace What=UUID=<uuid> to What=/dev/disk/by-uuid/<uuid>.
Fixes#8279Closes#8681
(cherry picked from commit 3d307919c3)
mp_row_consumer will not stop consuming large run of partition
tombstones, until a live row is found which will allow the consumer
to stop proceeding. So partition tombstones, from a large run, are
all accumulated in memory, leading to OOM and stalls.
The fix is about stopping the consumer if buffer is full, to allow
the produced fragments to be consumed by sstable writer.
Fixes#8071.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210514202640.346594-1-raphaelsc@scylladb.com>
Upstream fix: db4b9215dd
Currently, unified installer does not apply correct file security context
while copying files, it causes permission error on scylla-server.service.
We should apply default file security context while copying files, using
'-Z' option on /usr/bin/install.
Also, because install -Z requires normalized path to apply correct security
context, use 'realpath -m <PATH>' on path variables on the script.
Fixes#8589Closes#8602
(cherry picked from commit 60c0b37a4c)
Since we have added scylla-node-exporter, we needed to do 'install -d'
for systemd directory and sysconfig directory before copying files.
Fixes#8663Closes#8664
(cherry picked from commit 6faa8b97ec)
When recreating the paging state from an indexed query,
a bunch of panic checks were introduced to make sure that
the code is correct. However, one of the checks is too eager -
namely, it throws an error if the base column type is not equal
to the view column type. It usually works correctly, unless the
base column type is a clustering key with DESC clustering order,
in which case the type is actually "reversed". From the point of view
of the paging state generation it's not important, because both
types deserialize in the same way, so the check should be less
strict and allow the base type to be reversed.
Tests: unit(release), along with the additional test case
introduced in this series; the test also passes
on Cassandra
Fixes#8666Closes#8667
* github.com:scylladb/scylla:
test: add a test case for paging with desc clustering order
cql3: relax a type check for index paging
(cherry picked from commit 593ad4de1e)
Fedora version of systemd macros does not work correctly on CentOS7,
since CentOS7 does not support "file trigger" feature.
To fix the issue we need to stop using systemd macros, call systemctl
directly.
See scylladb/scylla-jmx#94
Closes#8005
(cherry picked from commit 7b310c591e)
run_custom_job() was swallowing all exceptions, which is definitely
wrong because failure in a resharding or reshape would be incorrectly
interpreted as success, which means upper layer will continue as if
everything is ok. For example, ignoring a failure in resharding could
result in a shared sstable being left unresharded, so when that sstable
reaches a table, scylla would abort as shared ssts are no longer
accepted in the main sstable set.
Let's allow the exception to be propagated, so failure will be
communicated, and resharding and reshape will be all or nothing, as
originally intended.
Fixes#8657.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210515015721.384667-1-raphaelsc@scylladb.com>
(cherry picked from commit 10ae77966c)
Compaction manager allows compaction of different weights to proceed in
parallel. For example, a small-sized compaction job can happen in parallel to a
large-sized one, but similar-sized jobs are serialized.
The problem is the current definition of weight, which is the log (base 4) of
total size (size of all sstables) of a job.
This is what we get with the current weight definition:
weight=5 for sizes=[1K, 3K]
weight=6 for sizes=[4K, 15K]
weight=7 for sizes=[16K, 63K]
weight=8 for sizes=[64K, 255K]
weight=9 for sizes=[258K, 1019K]
weight=10 for sizes=[1M, 3M]
weight=11 for sizes=[4M, 15M]
weight=12 for sizes=[16M, 63M]
weight=13 for sizes=[64M, 254M]
weight=14 for sizes=[256M, 1022M]
weight=15 for sizes=[1033M, 4078M]
weight=16 for sizes=[4119M, 10188M]
total weights: 12
Note that for jobs smaller than 1MB, we have 5 different weights, meaning 5
jobs smaller than 1MB could proceed in parallel. High number of parallel
compactions can be observed after repair, which potentially produces tons of
small sstables of varying sizes. That causes compaction to use a significant
amount of resources.
To fix this problem, let's add a fixed tax to the size before taking the log,
so that jobs smaller than 1M will all have the same weight.
Look at what we get with the new weight definition:
weight=10 for sizes=[1K, 2M]
weight=11 for sizes=[3M, 14M]
weight=12 for sizes=[15M, 62M]
weight=13 for sizes=[63M, 254M]
weight=14 for sizes=[256M, 1022M]
weight=15 for sizes=[1033M, 4078M]
weight=16 for sizes=[4119M, 10188M]
total weights: 7
Fixes#8124.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210217123022.241724-1-raphaelsc@scylladb.com>
(cherry picked from commit 81d773e5d8)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210512224405.68925-1-raphaelsc@scylladb.com>
The timestamp_type is an int64_t. So, it has to be explicitly
initialized before using it.
This missing inicialization prevented the major compactation
from happening when a time window finishes, as described in #8569.
Fixes#8569
Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com>
Closes#8590
(cherry picked from commit 15f72f7c9e)
This is a backport of 8aaa3a7 to branch-4.4. The main conflicts were around Benny's reader close series (fa43d76), but it also turned out that an additional patch (2f1d65c) also has to backported to make sure admission on signaling resources doesn't deadlock.
Refs: #8493Closes#8571
* github.com:scylladb/scylla:
test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress
test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units
reader_concurrency_semaphore: add dump_diagnostics()
reader_permit: always forward resources
test: multishard_mutation_query_test: fuzzy-test: don't consume resource up-front
reader_concurrency_semaphore: make admission conditions consistent
This unit test checks that the semaphore doesn't get into a deadlock
when contended, in the presence of many memory-only reads (that don't
wait for admission). This is tested by simulating the 3 kind of reads we
currently have in the system:
* memory-only: reads that don't pass admission and only own memory.
* admitted: reads that pass admission.
* evictable: admitted reads that are furthermore evictable.
The test creates and runs a large number of these reads in parallel,
read kinds being selected randomly, then creates a watchdog which
kills the test if no progress is being made.
(cherry picked from commit 45d580f056)
This unit test passes a read through admission again-and-again, just
like an evictable reader would be during its lifetime. When readmitted
the read sometimes has to wait and sometimes not. This is to check that
the readmitting a previously admitted reader doesn't leak any units.
(cherry picked from commit cadc26de38)
Allow semaphore related tests to include a diagnostics printout in error
messages to help determine why the test failed.
(cherry picked from commit d246e2df0a)
This commit conceptually reverts 4c8ab10. Said commit was meant to
prevent the scenario where memory-only permits -- those that don't pass
admission but still consume memory -- completely prevent the admission
of reads, possibly even causing a deadlock because a permit might even
blocks its own admission. The protection introduced by said commit
however proved to be very problematic. It made the status of resources
on the permit very hard to reason about and created loopholes via which
permits could accumulate without tracking or they could even leak
resources. Instead of continuing to patch this broken system, this
commit does away with this "protection" based on the observation that
deadlocks are now prevented anyway by the admission criteria introduced
by 0fe75571d9, which admits a read anyway when all the initial count
resources are available (meaning no admitted reader is alive),
regardless of availability of memory.
The benefits of this revert is that the semaphore now knows about all
the resources and is able to do its job better as it is not "lied to"
about resource by the permits. Furthermore the status of a permit's
resources is much simpler to reason about, there are no more loopholes
in unexpected state transitions to swallow/leak resources.
To prove that this revert is indeed safe, in the next commit we add
robust tests that stress test admission on a highly contested semaphore.
This patch also does away with the registered/admitted differentiation
of permits, as this doesn't make much sense anymore, instead these two
are unified into a single "active" state. One can always tell whether a
permit was admitted or not from whether it owns count resources anyway.
(cherry picked from commit caaa8ef59a)
The fuzzy test consumes a large chunk of resource from the semaphore
up-front to simulate a contested semaphore. This isn't an accurate
simulation, because no permit will have more than 1 units in reality.
Furthermore this can even cause a deadlock since 8aaa3a7 as now we rely
on all count units being available to make forward progress when memory
is scarce.
This patch just cuts out this part of the test, we now have a dedicated
unit test for checking a heavily contested semaphore, that does it
properly, so no need to try to fix this clumsy attempt that is just
making trouble at this point.
Refs: #8493
Tests: release(multishard_mutation_query_test:fuzzy_test)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429084458.40406-1-bdenes@scylladb.com>
(cherry picked from commit 26ae9555d1)
Currently there are two places where we check admission conditions:
`do_wait_admission()` and `signal()`. Both use `has_available_units()`
to check resource availability, but the former has some additional
resource related conditions on top (in `may_proceed()`), which lead to
the two paths working with slightly different conditions. To fix, push
down all resource availability related checks to `has_available_units()`
to ensure admission conditions are consistent across all paths.
(cherry picked from commit d90cd6402c)
Migration manager has a function to get a schema (for read or write),
this function queries a peer node and retrieves the schema from it. One
scenario where it can happen is if an old node, queries an old not fixed
index.
This makes a hole through which views that are only adjusted for reading
can slip through.
Here we plug the hole by fixing such views before they are registered.
Closes#8509
(cherry picked from commit 480a12d7b3)
Fixes#8554.
If any inactive read is left in the semaphore, it can block
`database::stop()` from shutting down, as sstables pinned by these reads
will prevent `sstables::sstables_manager::close()` from finishing. This
causes a deadlock.
It is not clear how inactive reads can be left in the semaphore, as all
users are supposed to clean up after themselves. Post 4.4 releases don't
have this problem anymore as the inactive read handle was made a RAII
object, removing the associated inactive read when destroyed. In 4.4 and
earlier release this wasn't so, so errors could be made. Normally this
is not a big issue, as these orphaned inactive reads are just evicted
when the resources they own are needed, but it does become a serious
issue during shutdown. To prevent a deadlock, clear the inactive reads
earlier, in `database::stop()` (currently they are cleared in the
destructor). This is a simple and foolproof way of ensuring any
leftover inactive reads don't cause problems.
Fixes: #8561
Tests: unit(dev)
Closes#8562
Current fs.aio-max-nr value cpu_count() * 11026 is exact size of scylla
uses, if other apps on the environment also try to use aio, aio slot
will be run out.
So increase value +65536 for other apps.
Related #8133Closes#8228
(cherry picked from commit 53c7600da8)
Current aio-max-nr is set up statically to 1048576 in
/etc/sysctl.d/99-scylla-aio.conf.
This is sufficient for most use cases, but falls short on larger machines
such as i3en.24xlarge on AWS that has 96 vCPUs.
We need to tune the parameter based on the number of cpus, instead of
static setting.
Fixes#8133
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#8188
(cherry picked from commit d0297c599a)
This check is always true because a dummy entry is added at the end of
each cache entry. If that wasn't true, the check in else-if would be
an UB.
Refs #8435.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit cb3dbb1a4b)
Make sure that when a partition does not exist in underlying,
do_fill_buffer does not try to fast forward withing this nonexistent
partition.
Test: unit(dev)
Fixes#8435Fixes#8411
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit 1f644df09d)
This new state stores the information whether current partition
represented by _key is present in underlying.
Refs #8435.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit ceab5f026d)
All callers pass false for its value so no need to keep it around.
Refs #8435.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit b3b68dc662)
This was previously done in create_underlying but ensure_underlying is
a better place because we will add more related logic to this
consumption in the following patches.
Refs #8435.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit 088a02aafd)
When a particular partition exists in at least one sstable, the cache
expects any single-partition query to this partition to return a `partition_start`
fragment, even if the result is empty.
In `time_series_sstable_set::create_single_key_sstable_reader` it could
happen that all sstables containing data for the given query get
filtered out and only sstables without the relevant partition are left,
resulting in a reader which immediately returns end-of-stream (while it
should return a `partition_start` and if not in forwarding mode, a
`partition_end`). This commit fixes that.
We do it by extending the reader queue (used by the clustering reader
merger) with a `dummy_reader` which will be returned by the queue as
the very first reader. This reader only emits a `partition_start` and,
if not in forwarding mode, a `partition_end` fragment.
Fixes#8447.
Closes#8448.
(cherry picked from commit 5c7ed7a83f)
The merger could return end-of-stream if some (but not all) of the
underlying readers were empty (i.e. not even returning a
`partition_start`). This could happen in places where it was used
(`time_series_sstable_set::create_single_key_sstable_reader`) if we
opened an sstable which did not have the queried partition but passed
all the filters (specifically, the bloom filter returned a false
positive for this sstable).
The commit also extends the random tests for the merger to include empty
readers and adds an explicit test case that catches this bug (in a
limited scope: when we merge a single empty reader).
It also modifies `test_twcs_single_key_reader_filtering` (regression
test for #8432) because the time where the clustering key filter is
invoked changes (some invocations move from the constructor of the
merger to operator()). I checked manually that it still catches the bug
when I reintroduce it.
Fixes#8445.
Closes#8446.
(cherry picked from commit 7ffb0d826b)
The filter passed to `min_position_reader_queue`, which was used by
`clustering_order_reader_merger`, would incorrectly include sstables as
soon as they passed through the PK (bloom) filter, and would include
sstables which didn't pass the PK filter (if they passed the CK
filter). Fortunately this wouldn't cause incorrect data to be returned,
but it would cause sstables to be opened unnecessarily (these sstables
would immediately return eof), resulting in a performance drop. This commit
fixes the filter and adds a regression test which uses statistics to
check how many times the CK filter was invoked.
Fixes#8432.
Closes#8433.
(cherry picked from commit 3687757115)
* seastar 2c884a7449...a75171fc89 (2):
> fair_queue: Preempted requests got re-queued too far
> fair_queue: Improve requests preemption while in pending state
Fixes#8296.
If a table that is not replicated to a certain DC (rf=0) is accessed
with LOCAL_QUORUM on that DC the current code will crash since the
'targets' array will be empty and read executor does not handle it.
Fix it by replying with empty result.
Fixes#8354
Message-Id: <YGro+l2En3fF80CO@scylladb.com>
(cherry picked from commit cd24dfc7e5)
[avi: re-added virtual keyword when backporting, since
4.4 and below don't have 020da49c89]
The `result_memory_accounter` terminates a query if it reaches either
the global or shard-local limit. This used to be so only for paged
queries, unpaged ones could grow indefinitely (until the node OOM'd).
This was changed in fea5067 which enforces the local limit on unpaged
queries as well, by aborting them. However a loophole remained in the
code: `result_memory_accounter::check_and_update()` has another stop
condition, besides `check_local_limit()`, it also checks the global
limit. This stop condition was not updated to enforce itself on unpaged
queries by aborting them, instead it silently terminated them, causing
them to return less data then requested. This was masked by most queries
reaching the local limit first.
This patch fixes this by aborting unpaged mutation queries when they hit
the global limit.
Fixes: #8162
Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226102202.51275-1-bdenes@scylladb.com>
(cherry picked from commit dd5a601aaa)
When shedding requests (e.g. due to their size or number exceeding the
limits), errors were returned right after parsing their headers, which
resulted in their bodies lingering in the socket. The server always
expects a correct request header when reading from the socket after the
processing of a single request is finished, so shedding the requests
should also take care of draining their bodies from the socket.
Fixes#8193Closes#8194
* github.com:scylladb/scylla:
cql-pytest: add a shedding test
transport: return error on correct stream during size shedding
transport: return error on correct stream during shedding
transport: skip the whole request if it is too large
transport: skip the whole request during shedding
(cherry picked from commit 0fea089b37)
This is a reworked submission of #7686 which has been reverted. This series
fixes some race conditions in MV/SI schema creation and load, we spotted some
places where a schema without a base table reference can sneak into the
registry. This can cause to an unrecoverable error since write commands with
those schemas can't be issued from other nodes. Most of those cases can occur on
2 main and uncommon cases, in a mixed cluster (during an upgrade) and in a small
window after a view or base table altering.
Fixes#7709Closes#8091
* github.com:scylladb/scylla:
database: Fix view schemas in place when loading
global_schema_ptr: add support for view's base table
materialized views: create view schemas with proper base table reference.
materialized views: Extract fix legacy schema into its own logic
(cherry picked from commit d473bc9b06)
Row marker has a cell name which sorts after the row tombstone's start
bound. The old code was writing the marker first, then the row
tombstone, which is incorrect.
This was harmeless to our sstable reader, which recognized both as
belonging to the current clustering row fragment, and collects both
fine.
However, if both atoms trigger creation of promoted index blocks, the
writer will create a promoted index with entries wich violate the cell
name ordering. It's very unlikely to run into in practice, since to
trigger promoted index entries for both atoms, the clustering key
would be so large so that the size of the marker cell exceeds the
desired promoted index block size, which is 64KB by default (but
user-controlled via column_index_size_in_kb option). 64KB is also the
limit on clustering key size accepted by the system.
This was caught by one of our unit tests:
sstable_conforms_to_mutation_source_test
...which runs a battery of mutation reader tests with various
desired promoted index block sizes, including the target size of 1
byte, which triggers an entry for every atom.
The test started to fail for some random seeds after commit ecb6abe
inside the
test_streamed_mutation_forwarding_is_consistent_with_slicing test
case, reporting a mutation mismatch in the following line:
assert_that(*sliced_m).is_equal_to(*fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key()));
It compares mutations read from the same sstable using different
methods, slicing using clustering key restricitons, and fast
forwarding. The reported mismatch was that fwd_m contained the row
marker, but sliced_m did not. The sstable does contain the marker, so
both reads should return it.
After reverting the commit which introduced dynamic adjustments, the
test passes, but both mutations are missing the marker, both are
wrong!
They are wrong because the promoted index contians entries whose
starting positions violate the ordering, so binary search gets confused
and selects the row tombstone's position, which is emitted after the
marker, thus skipping over the row marker.
The explanation for why the test started to fail after dynamic
adjustements is the following. The promoted index cursor works by
incrementally parsing buffers fed by the file input stream. It first
parses the whole block and then does a binary search within the parsed
array. The entries which cursor touches during binary search depend on
the size of the block read from the file. The commit which enabled
dynamic adjustements causes the block size to be different for
subsequent reads, which allows one of the reads to walk over the
corrupted entries and read the correct data by selecting the entry
corresponding to the row marker.
Fixes#8324
Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>
(cherry picked from commit 9272e74e8c)
"
_consumer_fut is expected to return an exception
on the abort path. Wait for it and drop any exception
so it won't be abandoned as seen in #7904.
A future<> close() method was added to return
_consumer_fut. It is called both after abort()
in the error path, and after consume_end_of_stream,
on the success path.
With that, consume_end_of_stream was made void
as it doesn't return a future<> anymore.
Fixes#7904
Test: unit(release)
"
* tag 'close-bucket-writer-v5' of github.com:bhalevy/scylla:
mutation_writer: bucket_writer: add close
mutation_writer/feed_writers: refactor bucket/shard writers
mutation_writer: update bucket/shard writers consume_end_of_stream
(cherry picked from commit f11a0700a8)
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.
This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.
This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.
This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.
For more details see #7961, #7993 and #7985.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#8048
* github.com:scylladb/scylla:
cdc: Limit size of topology description
cdc: Extract create_stream_ids from topology_description_generator
(cherry picked from commit c63e26e26f)
Currently we call firstNvmeSize before checking that we have enough
(at least 1) ephemeral disks. When none are found, we hit the following
error (see #7971):
```
File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in
if idata.is_recommended_instance():
File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance
diskSize = self.firstNvmeSize
File "/opt/scylladb/scripts/scylla_util.py", line 291, in firstNvmeSize
firstDisk = ephemeral_disks[0]
IndexError: list index out of range
```
This change reverses the order and first checks that we found
enough disks before getting the fist disk size.
Fixes#7971
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#8027
(cherry picked from commit 55e3df8a72)
When failing to rebuild a node, we would print the error with the useless
explanation "<no exception>". The problem was a typo in the logging command
which used std::current_exception() - which wasn't relevant in that point -
instead of "ep".
Refs #8089
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>
(cherry picked from commit d73934372d)
By default the boto3 library waits up to 60 second for a response,
and if got no response, it sends the same request again, multiple
times. We already noticed in the past that it retries too many times
thus slowing down failures, so in our test configuration lowered the
number of retries to 3, but the setting of 60-second-timeout plus
3 retries still causes two problems:
1. When the test machine and the build are extremely slow, and the
operation is long (usually, CreateTable or DeleteTable involving
multiple views), the 60 second timeout might not be enough.
2. If the timeout is reached, boto3 silently retries the same operation.
This retry may fail because the previous one really succeeded at
least partially! The symptom is tests which report an error when
creating a table which already exists, or deleting a table which
dooesn't exist.
The solution in this patch is first of all to never do retries - if
a query fails on internal server error, or times out, just report this
failure immediately. We don't expect to see transient errors during
local tests, so this is exactly the right behavior.
The second thing we do is to increase the default timeout. If 1 minute
was not enough, let's raise it to 5 minutes. 5 minutes should be enough
for every operation (famous last words...).
Even if 5 minutes is not enough for something, at least we'll now see
the timeout errors instead of some wierd errors caused by retrying an
operation which was already almost done.
Fixes#8135
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210222125630.1325011-1-nyh@scylladb.com>
(cherry picked from commit 0b2cf21932)
Relaxed mode, used during initialization, of reshape only tolerates min_threshold
(default: 4) L0 sstables. However, relaxed mode should tolerate more sstables in
level 0, otherwise boot will have to reshape level 0 every time it crosses the
min threshold. So let's make LCS reshape tolerate a max of max_threshold and 32.
This change is beneficial because once table is populated, LCS regular compaction
can decide to merge those sstables in level 0 into level 1 instead, therefore
reducing WA.
Refs #8297.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210318131442.17935-1-raphaelsc@scylladb.com>
(cherry picked from commit e53cedabb1)
Prior to 463d0ab, only one table could be cleaned up at a time on a given shard.
Since then, all tables belonging to a given keyspace are cleaned up in parallel.
Cleanup serialization on each shard was enforced with a semaphore, which was
incorrectly removed by the patch aforementioned.
So space requirement for cleanup to succeed can be up to the size of keyspace,
increasing the chances of node running out of space.
Node could also run out of memory if there are tons of tables in the keyspace.
Memory requirement is at least #_of_tables * 128k (not taking into account write
behind, etc). With 5k tables, it's ~0.64G per shard.
Also all tables being cleaned up in parallel will compete for the same
disk and cpu bandwidth, so making them all much slower, and consequently
the operation time is significantly higher.
This problem was detected with cleanup, but scrub and upgrade go through the
same rewrite procedure, so they're affected by exact the same problem.
Fixes#8247.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>
(cherry picked from commit 7171244844)
Previously, we crashed when the IN marker is bound to null. Throw
invalid_request_exception instead.
This is a 4.4 backport of the #8265 fix.
Tests: unit (dev)
(cherry picked from commit 8db24fc03b)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8307Fixes#8265.
The test populates the cache, then invalidates it, then tries to push
huge (10x times the segment size) chunks into seastar memory hoping that
the invalid entries will be evicted. The exit condition on the last
stage is -- total memory of the region (sum of both -- used and free)
becomes less than the size of one chunk.
However, the condition is wrong, because cache usually contains a dummy
entry that's not necessarily on lru and on some test iteration it may
happen that
evictable size < chunk size < evictable size + dummy size
In this case test fails with bad_alloc being unable to evict the memory
from under the dummy.
fixes: #7959
tests: unit(row_cache_test), unit(the failing case with the triggering
seed from the issue + 200 times more with random seeds)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210309134138.28099-1-xemul@scylladb.com>
(cherry picked from commit 096e452db9)
in all expressions' from Nadav Har'El.
This series fixes#5024 - which is about adding support for nested attribute
paths (e.g., a.b.c[2]) to Alternator. The series adds complete support for this
feature in ProjectionExpression, ConditionExpression, FilterExpression and
UpdateExpression - and also its combination with ReturnValues. Many relevant
tests - and also some new tests added in this series - now pass.
The first patch in the series fixes#8043 a bug in some error cases in
conditions, which was discovered while working in this series, and is
conceptually separate from the rest of the series.
Closes#8066
* github.com:scylladb/scylla:
alternator: correct implemention of UpdateItem with nested attributes and ReturnValues
alternator: fix bug in ReturnValues=UPDATED_NEW
alternator: implemented nested attribute paths in UpdateExpression
alternator: limit the depth of nested paths
alternator: prepare for UpdateItem nested attribute paths
alternator: overhaul ProjectionExpression hierarchy implementation
alternator: make parsed::path object printable
alternator-test: a few more ProjectionExpression conflict test cases
alternator-test: improve tests for nested attributes in UpdateExpression
alternator: support attribute paths in ConditionExpression, FilterExpression
alternator-test: improve tests for nested attributes in ConditionExpression
alternator: support attribute paths in ProjectionExpression
alternator: overhaul attrs_to_get handling
alternator-test: additional tests for attribute paths in ProjectionExpression
alternator-test: harden attribute-path tests for ProjectionExpression
alternator: fix ValidationException in FilterExpression - and more
(cherry picked from commit cbbb7f08a0)
So it can be modified while walked to dispatch
subscribed event notifications.
In #8143, there is a race between scylla shutdown and
notify_down(), causing use-after-free of cql_server.
Using an atomic vector itstead and futurizing
unregister_subscriber allows deleting from _lifecycle_subscribers
while walked using atomic_vector::for_each.
Fixes#8143
Test: unit(release)
DTest:
update_cluster_layout_tests:TestUpdateClusterLayout.add_node_with_large_partition4_test(release)
materialized_views_test.py:TestMaterializedViews.double_node_failure_during_mv_insert_4_nodes_test(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-2-bhalevy@scylladb.com>
(cherry picked from commit baf5d05631)
Move unregister_subscriber from the destructor to stop
as preparation for moving storage_service lifescyle_subscribers
to atomic_vector and futurizing unregister_subscriber.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-1-bhalevy@scylladb.com>
(cherry picked from commit 1ed04affab)
Ref #8143.
Due to regression introduced by 463d0ab, regular can compact in parallel a sstable
being compacted by cleanup, scrub or upgrade.
This redundancy causes resources to be wasted, write amplification is increased
and so does the operation time, etc.
That's a potential source of data resurrection because the now-owned data from
a sstable being compacted by both cleanup and regular will still exist in the
node afterwards, so resurrection can happen if node regains ownership.
Fixes#8155.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com>
(cherry picked from commit 2cf0c4bbf1)
Includes fixup patch:
compaction_manager: Fix use-after-free in rewrite_sstables()
Use-after-free introduced by 2cf0c4bbf1.
That's because compacting is moved into then_wrapped() lambda, so it's
potentially freed on the next iteration of repeat().
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210309232940.433490-1-raphaelsc@scylladb.com>
(cherry picked from commit f7cc431477)
Currently, the rpc timeout error for the GOSSIP_GET_ENDPOINT_STATES verb
is not handled in gossiper::do_shadow_round. If the
GOSSIP_GET_ENDPOINT_STATES rpc call to any of the remote nodes goes
timeout, gossiper::do_shadow_round will throw an exception and fail the
whole boot up process.
It is fine that some of the remote nodes timeout in shadow round. It is
not a must to talk to all nodes.
This patch fixes an issue we saw recently in our sct tests:
```
INFO | scylla[1579]: [shard 0] init - Shutting down gossiping
INFO | scylla[1579]: [shard 0] gossip - gossip is already stopped
INFO | scylla[1579]: [shard 0] init - Shutting down gossiping was successful
...
ERR | scylla[1579]: [shard 0] init - Startup failed: seastar::rpc::timeout_error (rpc call timed out)
```
Fixes#8187Closes#8213
(cherry picked from commit dc40184faa)
Refs: #8012Fixes: #8210
With the update to CDC generation management, the way we retrieve and process these changed.
One very bad bug slipped through though; the code for getting versioned streams did not take into
account the late-in-pr change to make clustering of CDC gen timestamps reversed. So our alternator
shard info became quite rump-stumped, leading to more or less no data depending on when generations
changed w.r. data.
Also, the way we track the above timestamps changed, so we should utilize this for our end-of-iterator check.
Closes#8209
* github.com:scylladb/scylla:
alternator::streams: Use better method for generation timestamp
system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp
system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range
(cherry picked from commit e12e57c915)
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments
This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.
The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.
Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.
---
Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.
However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. We add an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).
Closes#8116
* github.com:scylladb/scylla:
tests: add a simple CDC cql pytest
cdc: add config option to disable streams rewriting
cdc: rewrite streams to the new description table
cql3: query_processor: improve internal paged query API
cdc: introduce no_generation_data_exception exception type
docs: cdc: mention system.cdc_local table
cdc: coroutinize do_update_streams_description
sys_dist_ks: split CDC streams table partitions into clustered rows
cdc: use chunked_vector for streams in streams_version
cdc: remove `streams_version::expired` field
system_distributed_keyspace: use mutation API to insert CDC streams
storage_service: don't use `sys_dist_ks` before it is started
(cherry picked from commit f0950e023d)
On Ubuntu 20.04 AMI, scylla_raid_setup --raiddev /dev/md0 causes
'/dev/md0 is already using' (issue #7627).
So we merged the patch to find free mdX (587b909).
However, look into /proc/mdstat of the AMI, it actually says no active md device available:
ubuntu@ip-10-0-0-43:~$ cat /proc/mdstat
Personalities :
unused devices: <none>
We currently decide mdX is used when os.path.exists('/sys/block/mdX/md/array_state') == True,
but according to kernel doc, the file may available even array is STOPPED:
clear
No devices, no size, no level
Writing is equivalent to STOP_ARRAY ioctl
https://www.kernel.org/doc/html/v4.15/admin-guide/md.html
So we should also check array_state != 'clear', not just array_state
existance.
Fixes#8219Closes#8220
(cherry picked from commit 2d9feaacea)
expired sstables are skipped in the compaction setup phase, because they don't
need to be actually compacted, but rather only deleted at the end.
that is causing such sstables to not be removed from the backlog tracker,
meaning that backlog caused by expired sstables will not be removed even after
their deletion, which means shares will be higher than needed, making compaction
potentially more aggressive than it have to.
to fix this bug, let's manually register these sstables into the monitor,
such that they'll be removed from the tracker once compaction completes.
Fixes#6054.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210216203700.189362-1-raphaelsc@scylladb.com>
(cherry picked from commit 5206a97915)
TWCS reshape was silently ignoring windows which contain at least
min_threshold sstables (can happen with data segregation).
When resizing candidates, size of multi_window was incorrectly used and
it was always empty in this path, which means candidates was always
cleared.
Fixes#8147.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com>
(cherry picked from commit 21608bd677)
dh_installinit --name <service> is for forcing install debian/*.service
and debian/*.default that does not matches with package name.
And if we have subpackages, packager has responsibility to rename
debian/*.service to debian/<subpackage>.*service.
However, we currently mistakenly running
dh_installinit --name scylla-node-exporter for
debian/scylla-node-exporeter.service,
the packaging system tries to find destination package for the .service,
and does not find subpackage name on it, so it will pick first
subpackage ordered by name, scylla-conf.
To solve the issue, we just need to run dh_installinit without --name
when $product == 'scylla'.
Fixes#8163Closes#8164
(cherry picked from commit aabc67e386)
We currently deny running scylla_setup when umask != 0022.
To remove this limitation, run os.chmod(0o644) on every file creation
to allow reading from scylla user.
Note that perftune.yaml is not really needed to set 0644 since perftune.py is
running in root user, but setting it to align permission with other files.
Fixes#8049Closes#8119
(cherry picked from commit f3a82f4685)
* seastar 572536ef...74ae29bc (3):
> perftune.py: fix assignment after extend and add asserts
> scripts/perftune.py: convert nic option in old perftune.yaml to list for compatibility
> scripts/perftune.py: remove repeated items after merging options from file
Fixes#7968.
The schema used to create the sstable writer has to be the same as the
schema used by the reader, as the former is used to intrpret mutation
fragments produced by the reader.
Commit 9124a70 intorduced a deferring point between reader creation
and writer creation which can result in schema mismatch if there was a
concurrent alter.
This could lead to the sstable write to crash, or generate a corrupted
sstable.
Fixes#7994
Message-Id: <20210222153149.289308-1-tgrabiec@scylladb.com>
When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly
reports root partition is part of ephemeral disks, and RAID construction will
fail.
This prevents the error and reports correct free disks.
Fixes#8055Closes#8040
(cherry picked from commit 32d4ec6b8a)
The first condition expressions we implemented in Alternator were the old
"Expected" syntax of conditional updates. That implementation had some
specific assumptions on how it handles errors: For example, in the "LT"
operator in "Expected", the second operand is always part of the query, so
an error in it (e.g., an unsupported type) resulted it a ValidationException
error.
When we implemented ConditionExpression and FilterExpression, we wrongly
used the same functions check_compare(), check_BETWEEN(), etc., to implement
them. This results in some inaccurate error handling. The worst example is
what happens when you use a FilterExpression with an expression such as
"x < y" - this filter is supposed to silently skip items whose "x" and "y"
attributes have unsupported or different types, but in our implementation
a bad type (e.g., a list) for y resulted in a ValidationException which
aborted the entire scan! Interestingly, in once case (that of BEGINS_WITH)
we actually noticed the slightly different behavior needed and implemented
the same operator twice - with ugly code duplication. But in other operators
we missed this problem completely.
This patch first adds extensive tests of how the different expressions
(Expected, QueryFilter, FilterExpression, ConditionExpression) and the
different operators handle various input errors - unsupported types,
missing items, incompatible types, etc. Importantly, the tests demonstrate
that there is often different behavior depending on whether the bad
input comes from the query, or from the item. Some of the new tests
fail before this patch, but others pass and were useful to verify that
the patch doesn't break anything that already worked correctly previously.
As usual, all the tests pass on Cassandra.
Finally, this patch *fixes* all these problems. The comparison functions
like check_compare() and check_BETWEEN() now not only take the operands,
they also take booleans saying if each of the operands came from the
query or from an item. The old-syntax caller (Expected or QueryFilter)
always say that the first operand is from the item and the second is
from the query - but in the new-syntax caller (ConditionExpression or
FilterExpression) any or all of the operands can come from the query
and need verification.
The old duplicated code for check_BEGINS_WITH() - which a TODO to remove
it - is finally removed. Instead we use the same idea of passing booleans
saying if each of its operands came from an item or from the query.
Fixes#8043
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 653610f4bc)
One of the USING TIMEOUT tests relied on a specific TTL value,
but that's fragile if the test runs on the boundary of 2 seconds.
Instead, the test case simply checks if the TTL value is present
and is greater than 0, which makes the test robust unless its execution
lasts for more than 1 million seconds, which is highly unlikely.
Fixes#8062Closes#8063
(cherry picked from commit 2aa4631148)
scylla_io_setup condition for nr_disks was using the bitwise operator
(&) instead of logical and operator (and) causing the io_properties
files to have incorrect values
Fixes#7341
Reviewed-by: Lubos Kosco <lubos@scylladb.com>
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Closes#8019
(cherry picked from commit 718976e794)
node-exporter systemd unit name is "scylla-node-exporter.service", not
"node-exporter.service".
Fixes#8054Closes#8053
(cherry picked from commit 856fe12e13)
The test timeuuid_test.py::testTimeuuid sporadically failed, and it turns out
the reason was a bug in the test - which this patch fixes.
The buggy test created a timeuuid and then compared the time stored in it
to the result of the dateOf() CQL function. The problem is that dateOf()
returns a CQL "timestamp", which has millisecond resolution, while the
timeuuid *may* have finer than millisecond resolution. The reason why this
test rarely failed is that in our implementation, the timeuuid almost
always gets a millisecond-resolution timestamp. Only if now() gets called
more than once in one millisecond, does it pick a higher time incremented
by less than a millisecond.
What this patch does is to truncate the time read from the timeuuid to
millisecond resolution, and only then compare it to the result of dateOf().
We cannot hope for more.
Fixes#8060
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210211165046.878371-1-nyh@scylladb.com>
(cherry picked from commit a03a8a89a9)
Since fea5067df we enforce a limit on the memory consumption of
otherwise non-limited queries like reverse and non-paged queries. This
limit is sent down to the replicas by the coordinator, ensuring that
each replica is working with the same limit. This however doesn't work
in a mixed cluster, when upgrading from a version which doesn't have
this series. This has been worked around by falling back to the old
max_result_size constant of 1MB in mixed clusters. This however resulted
in a regression when upgrading from a pre fea5067df to a post fea5067df
one. Pre fea5067df already had a limit for reverse queries, which was
generalized to also cover non-paged ones too by fea5067df.
The regression manifested in previously working reverse queries being
aborted. This happened because even though the user has set a generous
limit for them before the upgrade, in the mix cluster replicas fall back
to the much stricter 1MB limit temporarily ignoring the configured limit
if the coordinator is an old node. This patch solves this problem by
using the locally configured limit instead of the max_result_size
constant. This means that the user has to take extra care to configure
the same limit on all replicas, but at least they will have working
reverse queries during the upgrade.
Fixes: #8022
Tests: unit(release), manual test by user who reported the issue
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210209075947.1004164-1-bdenes@scylladb.com>
(cherry picked from commit 3d001b5587)
Few method in column_familiy API were doing the aggregation wrong,
specifically, bloom filter disk size.
The issue is not always visible, it happens when there are multiple
filter files per shard.
Fixes#4513
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#8007
(cherry picked from commit 4498bb0a48)
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.
Fixes#6243Closes#6244
* github.com:scylladb/scylla:
dist/offline_redhat: fix umask error
dist/offline_installer/redhat: support cross build
(cherry picked from commit bb202db1ff)
This is a revival of #7490.
Quoting #7490:
The managed_bytes class now uses implicit linearization: outside LSA, data is never fragmented, and within LSA, data is linearized on-demand, as long as the code is running within with_linearized_managed_bytes() scope.
We would like to stop linearizing managed_bytes and keep it fragmented at all times, since linearization can require large contiguous chunks. Large contiguous allocations are hard to satisfy and cause latency spikes.
As a first step towards that, we remove all implicitly linearizing accessors and replace them with an explicit linearization accessor, with_linearized().
Some of the linearization happens long before use, by creating a bytes_view of the managed_bytes object and passing it onwards, perhaps storing it for later use. This does not work with with_linearized(), which creates a temporary linearized view, and does not work towards the longer term goal of never linearizing. As a substitute a managed_bytes_view class is introduced that acts as a view for managed_bytes (for interoperability it can also be a view for bytes and is compatible with bytes_view).
By the end of the series, all linearizations are temporary, within the scope of a with_linearized() call and can be converted to fragmented consumption of the data at leisure.
This has limited practical value directly, as current uses of managed_bytes are limited to keys (which are limited to 64k). However, it enables converting the atomic_cell layer back to managed_bytes (so we can remove IMR) and the CQL layer to managed_bytes/managed_bytes_view, removing contiguous allocations from the coordinator.
Closes#7820
* github.com:scylladb/scylla:
test: add hashers_test
memtable: fix accounting of managed_bytes in partition_snapshot_accounter
test: add managed_bytes_test
utils: fragment_range: add a fragment iterator for FragmentedView
keys: update comments after changes and remove an unused method
mutation_test: use the correct preferred_max_contiguous_allocation in measuring_allocator
row_cache: more indentation fixes
utils: remove unused linearization facilities in `managed_bytes` class
misc: fix indentation
treewide: remove remaining `with_linearized_managed_bytes` uses
memtable, row_cache: remove `with_linearized_managed_bytes` uses
utils: managed_bytes: remove linearizing accessors
keys, compound: switch from bytes_view to managed_bytes_view
sstables: writer: add write_* helpers for managed_bytes_view
compound_compat: transition legacy_compound_view from bytes_view to managed_bytes_view
types: change equal() to accept managed_bytes_view
types: add parallel interfaces for managed_bytes_view
types: add to_managed_bytes(const sstring&)
serializer_impl: handle managed_bytes without linearizing
utils: managed_bytes: add managed_bytes_view::operator[]
utils: managed_bytes: introduce managed_bytes_view
utils: fragment_range: add serialization helpers for FragmentedMutableView
bytes: implement std::hash using appending_hash
utils: mutable_view: add substr()
utils: fragment_range: add compare_unsigned
utils: managed_bytes: make the constructors from bytes and bytes_view explicit
utils: managed_bytes: introduce with_linearized()
utils: managed_bytes: constrain with_linearized_managed_bytes()
utils: managed_bytes: avoid internal uses of managed_bytes::data()
utils: managed_bytes: extract do_linearize_pure()
thrift: do not depend on implicit conversion of keys to bytes_view
clustering_bounds_comparator: do not depend on implicit conversion of keys to bytes_view
cql3: expression: linearize get_value_from_mutation() eariler
bytes: add to_bytes(bytes)
cql3: expression: mark do_get_value() as static
This test is a sanity check. It verifies that our wrappers over well known
hashes (xxhash, md5, sha256) actually calculate exactly those hashes.
It also checks that the `update()` methods of used hashers are linear with
respect to concatenation: that is, `update(a + b)` must be equivalent to
`update(a); update(b)`. This wasn't relied on before, but now we need to
confirm that hashing fragmented keys without linearizing them won't break
backward compatibility.
managed_bytes has a small overhead per each fragment. Due to that, managed_bytes
containing the same data can have different total memory usage in different
allocators. The smaller the preferred max allocation size setting is, the more
fragments are needed and the greater total per-fragment overhead is.
In particular, managed_bytes allocated in the LSA could grow in
memory usage when copied to the standard allocator, if the standard allocator
had a preferred max allocation setting smaller than the LSA.
partition_snapshot_accounter calculates the amount of memory used by
mutation fragments in the memtable (where they are allocated with LSA) based
on the memory usage after they are copied to the standard allocator.
This could result in an overestimation, as explained above.
But partition_snapshot_accounter must not overestimate the amount of freed
memory, as doing otherwise might result in OOM situations.
This patch prevents the overaccounting by adding minimal_external_memory_usage():
a new version of external_memory_usage(), which ignores allocator-dependent
overhead. In particular, it includes the per-fragment overhead in managed_bytes
only once, no matter how many fragments there are.
The comments were outdated after the latest changes (bytes_view vs
managed_bytes_view).
compound_view_wrapper::get_component() is unused, so we remove it.
Since we introduced relocatable package and offline installer, scylla binary itself can run almost any distributions.
However, setup scripts are not designed to run in unsupported distributions, it causes error on such environment.
This PR adds minimal support to run offline installation on unsupported distributions, tested on SLES, Arch Linux and Gentoo.
Closes#7858
* github.com:scylladb/scylla:
dist: use sysconfig_parser to parse gentoo config file
dist: add package name translation
dist: support SLES/OpenSUSE
install.sh: add systemd existance check
install.sh: ignore error missing sysctl entries
dist: show warning on unsupported distributions
dist: drop Ubuntu 14.04 code
dist: move back is_amzn2() to scylla_util.py
dist: rename is_gentoo_variant() to is_gentoo()
dist: support Arch Linux
dist: make sysconfig directory detectable
Flush is facing stalls because partition_snapshot_flat_reader::fill_buffer()
generates mutation fragment until buffer is full[1] without yielding.
this is the code path:
flush_reader::fill_buffer() <---------|
flat_mutation_reader::consume_pausable() <--------|
partition_snapshot_flat_reader::fill_buffer() -|
[1]: https://github.com/scylladb/scylla/blob/6cfc949e/partition_snapshot_reader.hh#L261
This is fixed by breaking the loop in do_fill_buffer() if preemption is needed,
allowing do_until() to yield in sequence, and when it resumes, continue from
where it left off, until buffer is full.
Fixes#7885.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210114141417.285175-1-raphaelsc@scylladb.com>
implicit revert of 6322293263
sshd previosly was used by the scylla manager 1.0.
new version does not need it. there is no point of
having it currently. it also confuses everyone.
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
Closes#7921
The client_state::check_access() calls for global storage service
to get the features from it and check if the CDC feature is on.
The latter is needed to perform CDC-specific checks.
However it was noticed, that the check for the feature is excessive
as all the guarded if-s will resolve to false in case CDC is off
and the check_access will effectively work as it would with the
feature check.
With that observation, it's possible to ditch one more global storage
service reference.
tests: unit(dev), dtest(dev, auth)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210105063651.7081-1-xemul@scylladb.com>
offline installer can run in non-systemd distributions, but it won't
work since we only have systemd units.
So check systemd existance and print error message.
is_redhat_variant() is the function to detect RHEL/CentOS/Fedora/OEL,
and is_debian_variant() is the function to detect Debian/Ubuntu.
Unlike these functions, is_gentoo_variant() does not detect "Gentoo variants",
we should rename it to is_gentoo().
Currently, install.sh provide a way to customize sysconfig directory,
but sysconfig directory is hardcoded on script.
Also, /etc/sysconfig seems correct to use default value, but current
code specify /etc/default as non-redhat distributions.
Instead of hardcoding, generate generate python script in install.sh
to save specified sysconfig directory path in python code.
The reply to a /column_family/ GET request contains info about all
column families. Currently, all this info is stored in a single
string when replying, and this string may require a big allocation
when there are many column families.
To avoid that allocation, instead of a single string, use a
body_writer function, which writes chunks of the message content
to the output stream.
Fixes#7916
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#7917
After these changes the generated code deserializes the stream into a chunked vector, instead of an contiguous one, so even if there are many fields in it, there won't be any big allocations.
I haven't run the scylla cluster test with it yet but it passes the unit tests.
Closes#7919
* github.com:scylladb/scylla:
idl: change the type of mutation_partition_view::rows() to a chunked_vector
idl-compiler: allow fields of type utils::chunked_vector
Numbers in JSON are not limited in range, so when the fromJson() function
converts a number to a limited-range integer column in Scylla, this
conversion can overflow. The following tests check that this conversion
should result in an error (FunctionFailure), not silent trunction.
Scylla today does silently wrap around the number, so these tests
xfail. They pass on Cassandra.
Refs #7914.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112151041.3940361-1-nyh@scylladb.com>
This patch adds more (failing) tests for issue #7911, where fromJson()
failures should be reported as a clean FunctionFailure error, not an
internal server error.
The previous tests we had were about JSON parse failures, but a
different type of error we should support is valid JSON which returned
the wrong type - e.g., the JSON returning a string when an integer
was expected, or the JSON returning a string with non-ASCII characters
when ASCII was expected. So this patch adds more such tests. All of
them xfail on Scylla, and pass on Cassandra.
Refs #7911.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112122211.3932201-1-nyh@scylladb.com>
This patch adds a reproducer test for issue #7912, which is about passing
a null parameter to the fromJson() function supposed to be legal (and
return a null value), and is legal in Cassandra, but isn't allowed in
Scylla.
There are two tests - for a prepared and unprepared statement - which
fail in different ways. The issue is still open so the tests xfail on
Scylla - and pass on Cassandra.
Refs #7912.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112114254.3927671-1-nyh@scylladb.com>
Related issue scylladb/sphinx-scylladb-theme#88
Once this commit is merged, the docs will be published under the new domain name https://scylla.docs.scylladb.com
Frequently asked questions:
Should we change the links in the README/docs folder?
GitHub automatically handles the redirections. For example, https://scylladb.github.io/sphinx-scylladb-theme/stable/examples/index.html redirects to https://sphinx-theme.scylladb.com/stable/examples/index.html
Nevertheless, it would be great to change URLs progressively to avoid the 301 redirections.
Do I need to add this new domain in the custom dns domain section on GitHub settings?
It is not necessary. We have already edited the DNS for this domain and the theme creates programmatically the required CNAME file. If everything goes well, GitHub should detect the new URL after this PR is merged.
The DNS doesn't seem to have the right SSL certificates
GitHub handles the certificate provisioning but is not aware of the subdomain for this repo yet. make multi-version will create a new file "CNAME". This is published in gh-pages branch, therefore GitHub should create the missing cert.
Closes#7877
Use the thread_local seastar::testing::local_random_engine
in all seastar tests so they can be reproduced using
the --random-seed option.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210112103713.578301-2-bhalevy@scylladb.com>
The min/max aggregators use aggregate_type_for comparators, and the
aggregate_type_for<timeuuid> is regular uuid. But that yields wrong
results; timeuuids should be compared as timestamps.
Fix it by changing aggregate_type_for<timeuuid> from uuid to timeuuid,
so aggregators can distinguish betwen the two. Then specialize the
aggregation utilities for timeuuid.
Add a cql-pytest and change some unit tests, which relied on naive
uuid comparators.
Fixes#7729.
Tests: unit (dev, debug)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7910
"
Without interposer consumer on flush, it could happen that a new sstable,
produced by memtable flush, will not conform to the strategy invariant.
For example, with TWCS, this new sstable could span multiple time windows,
making it hard for the strategy to purge expired data. If interposer is
enabled, the data will be correctly segregated into different sstables,
each one spanning a single window.
Fixes#4617.
tests:
- mode(dev).
- manually tested it by forcing a flush of memtable spanning many windows
"
* 'segregation_on_flush_v2' of github.com:raphaelsc/scylla:
test: Add test for TWCS interposer on memtable flush
table: Wire interposer consumer for memtable flush
table: Add write_memtable_to_sstable variant which accepts flat_mutation_reader
table: Allow sstable write permit to be shared across monitors
memtable: Track min timestamp
table: Extend cache update to operate a memtable split into multiple sstables
This patch adds a reproducer test for issue #7911, which is about a parse
error in JSON string passed to the fromJson() function causing an
internal error instead of the expected FunctionFailure error.
The issue is still open so the test xfails on Scylla (and passes on
Cassandra).
Refs #7911.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112094629.3920472-1-nyh@scylladb.com>
The option can only take integer values >= 0, since negative
TTL is meaningless and is expected to fail the query when used
with `USING TTL` clause.
It's better to fail early on `CREATE TABLE` and `ALTER TABLE`
statement with a descriptive message rather than catch the
error during the first lwt `INSERT` or `UPDATE` while trying
to insert to system.paxos table with the desired TTL.
Tests: unit(dev)
Fixes: #7906
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210111202942.69778-1-pa.solodovnikov@scylladb.com>
Unfortunately snapshot checking still does not work in the presence of
log entries reordering. It is impossible to know when exactly the
snapshot will be taken and if it is taken before all smaller than
snapshot idx entries are applied the check will fail since it assumes
that.
This patch disabled snapshot checking for SUM state machine that is used
in backpressure test.
Message-Id: <20201126122349.GE1655743@scylladb.com>
The value of mutation_partition_view::rows() may be very large, but is
used almost exclusively for iteration, so in order to avoid a big allocation
for an std::vector, we change its type to an utils::chunked_vector.
Fixes#7918
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The utils::chunked_vector has practically the same methods
as a std::vector, so the same code can be generated for it.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
We have recently seen a suspected corrupt mutation fragment stream to get
into an sstable undetected, causing permanent corruption. One of the
suspected ways this could happen is the compaction sstable write path not
being covered with a validator. To prevent events like this in the future
make sure all sstable write paths are validated by embedding the validator
right into the sstable writer itself.
Refs: #7623
Refs: #7640
Tests: unit(release)
* https://github.com/denesb/scylla.git sstable-writer-fragment-stream-validation/v2:
sstable_writer: add validation
test/boost/sstable_datafile_test: sstable_scrub_test: disable key validation
mutation_fragment_stream_validator: make it easier to validate concrete fragment types
flat_mutation_reader: extract fragment stream validator into its own header
Cassandra constructs `QueryOptions.SpecificOptions` in the same
way that we do (by not providing `serial_constency`), but they
do have a user-defined constructor which does the following thing:
this.serialConsistency = serialConsistency == null ? ConsistencyLevel.SERIAL : serialConsistency;
This effectively means that DEFAULT `SpecificOptions` always
have `SerialConsistency` set to `SERIAL`, while we leave this
`std::nullopt`, since we don't have a constructor for
`specific_options` which does this.
Supply `db::consistency_level::SERIAL` explicitly to the
`specific_options::DEFAULT` value.
Tests: unit(dev)
Fixes: #7850
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201231104018.362270-1-pa.solodovnikov@scylladb.com>
This adds a simple reproducer for a bug involving a CONTAINS relation on
frozen collection clustering columns when the query is restricted to a
single partition - resulting in a strange "marshalling error".
This bug still exists, so the test is marked xfail.
Refs #7888.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210107191417.3775319-1-nyh@scylladb.com>
We add a reproducer for issues #7868 and #7875 which are about bugs when
a table has a frozen collection as its clustering key, and it is sorted
in *reverse order*: If we tried to insert an item to such a table using an
unprepared statement, it failed with a wrong error ("invalid set literal"),
but if we try to set up a prepared statement, the result is even worse -
an assertion failure and a crash.
Interestingly, neither of these problems happen without reversed sort order
(WITH CLUSTERING ORDER BY (b DESC)), and we also add a test which
demonstrates that with default (increasing) order, everything works fine.
All tests pass successfully when run against Cassandra.
The fix for both issues was already committed, so I verified these tests
reproduced the bug before that commit, and pass now.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210110232312.3844408-1-nyh@scylladb.com>
In this patch, we port validation/entities/frozen_collections_test.java,
containing 33 tests for frozen collections of all types, including
nesting collections.
In porting these tests, I uncovered four previously unknown bugs in Scylla:
Refs #7852: Inserting a row with a null key column should be forbidden.
Refs #7868: Assertion failure (crash) when clustering key is a frozen
collection and reverse order.
Refs #7888: Certain combination of filtering, index, and frozen collection,
causes "marshalling error" failure.
Refs #7902: Failed SELECT with tuple of reversed-ordered frozen collections.
These tests also provide two more reproducers for an already known bug:
Refs #7745: Length of map keys and set items are incorrectly limited to
64K in unprepared CQL.
Due to these bugs, 7 out of the 33 tests here currently xfail. We actually
had more failing tests, but we fixed issue #7868 before this patch went in,
so its tests are passing at the time of this submission.
As usual in these sort of tests, all 33 pass when running against Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210110231350.3843686-1-nyh@scylladb.com>
In test_streams.py we had some code to get a list of shards and iterators
duplicated three times. Put it in a function, shards_and_latest_iterators(),
to reduce this duplication.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201006112421.426096-1-nyh@scylladb.com>
Add a mutation_fragment_stream_validating_filter to
sstables::writer_impl and use it in sstable_writer to validate the
fragment stream passed down to the writer implementation. This ensures
that all fragment streams written to disk are validated, and we don't
have to worry about validating each source separately.
The current validator from sstable::write_components() is removed. This
covers only part of the write paths. Ad-hoc validations in the reader
implementations are removed as well as they are now redundant.
The test violates clustering key order on purpose to produce a corrupt
sstable (to test scrub). Disable key validation so when we move the
validator into the writer itself in the next patch it doesn't abort the
test.
The current API is tailored to the `mutation_fragment` type. In
the next patch we will want to use the validator from a context where
the mutation fragments are already decomposed into their respective
concrete types, e.g. static_row, clustering_row, etc. To avoid having to
reconstruct a mutation fragment type just to use the validator, add an
API which allows validating these concrete types conveniently too.
Replace two methods for unreversal (`as` and `self_or_reversed`) with
a new one (`without_reversed`). More flexible and better named.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7889
Currently, frozen mutations, that contain partitions with out-of-order
or duplicate rows will trigger (if they even do) an assert in
`row::append_cell()`. However, this results in poor diagnostics (if at
all) as the context doesn't contain enough information on what exactly
went wrong. This results in a cryptic error message and an investigation
that can only start after looking at a coredump.
This series remedies this problem by explicitly checking for
out-of-order and duplicate rows, as early as possible, when the
supposedly empty row is created. If the row already existed (is a
duplicate) or it is not the last row in the partition (out-of-order row)
an exception is thrown and the deserialization is aborted. To further
improve diagnostics, the partition context is also added to the
exception.
Tests: unit(release)
* botond/frozen-mutation-bad-row-diagnostics/v3:
frozen_mutation: add partition context to errors coming from deserializing
partition_builder: accept_row(): use append_clustering_row()
mutation_partition: add append_clustered_row()
measuring_allocator is a wrapper around standard_allocator, but it exposed
the default preferred_max_contiguous_allocation, not the one from
standard_allocator. Thus managed_bytes allocated in those two allocators
had fragments of different size, and their total memory usage differed,
causing test_external_memory_usage to fail if
standard_allocator::preferred_max_contiguous_allocation was changed from the
default. Fix that.
Remove the following bits of `managed_bytes` since they are unused:
* `with_linearized_managed_bytes` function template
* `linearization_context_guard` RAII wrapper class for managing
`linearization_context` instances.
* `do_linearize` function
* `linearization_context` class
Since there is no more public or private methods in `managed_class`
to linearize the value except for explicit `with_linearized()`,
which doesn't use any of aforementioned parts, we can safely remove
these.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The patch fixes indentation issues introduced in previous patches
related to removing `with_linearized_managed_bytes` uses from the
code tree.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
There is no point in calling the wrapper since linearization code
is private in `managed_bytes` class and there is no one to call
`managed_bytes::data` because it was deleted recently.
This patch is a prerequisite for removing
`with_linearized_managed_bytes` function completely, alongside with
the corresponding parts of implementation in `managed_bytes`.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Since `managed_bytes::data()` is deleted as well as other public
APIs of `managed_bytes` which would linearize stored values except
for explicit `with_linearized`, there is no point
invoking `with_linearized_managed_bytes` hack which would trigger
automatic linearization under the hood of managed_bytes.
Remove useless `with_linearized_managed_bytes` wrapper from
memtable and row_cache code.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The keys classes (partition_key et al) already use managed_bytes,
but they assume the data is not fragmented and make liberal use
of that by casting to bytes_view. The view classes use bytes_view.
Change that to managed_bytes_view, and adjust return values
to managed_bytes/managed_bytes_view.
The callers are adjusted. In some places linearization (to_bytes())
is needed, but this isn't too bad as keys are always <= 64k and thus
will not be fragmented when out of LSA. We can remove this
linearization later.
The serialize_value() template is called from a long chain, and
can be reached with either bytes_view or managed_bytes_view.
Rather than trace and adjust all the callers, we patch it now
with constexpr if.
operator bytes_view (in keys) is converted to operator
managed_bytes_view, allowing callers to defer or avoid
linearization.
bytes_view can convert to managed_bytes_view, so the change
is compatible with the existing representation and the next
patches, which change compound types to use managed_bytes_view.
This operator has a single purpose: an easier port of legacy_compound_view
from bytes_view to managed_bytes_view.
It is inefficient and should be removed as soon as legacy_compound_view stops
using operator[].
managed_bytes_view is a non-owning view into managed_bytes.
It can also be implicitly constructed from bytes_view.
It conforms to the FragmentedView concept and is mainly used through that
interface.
It will be used as a replacement for bytes_view occurrences currently
obtained by linearizing managed_bytes.
This is a preparation for the upcoming introduction of managed_bytes_view,
intended as a fragmented replacement for bytes_view.
To ease the transition, we want both types to give equal hashes for equal
contents.
Unset values for key and value were not handled. Handle them in a
manner matching Cassandra.
This fixes all cases in testMapWithUnsetValues, so re-enable it (and
fix a comment typo in it).
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the right-hand side of IN is an unset value, we must report an
error, like Cassandra does.
This fixes testListWithUnsetValues, so re-enable it.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Make the bind() operation of the scalar marker handle the unset-value
case (which it previously didn't).
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Avoid crash described in #7740 by ignoring the update when the
element-to-remove is UNSET_VALUE.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Since we haven't implemented parse error on redis protocol parser,
reply message is broken at parse error.
Implemented parse error, reply error message correctly.
Fixes#7861Fixes#7114Closes#7862
When the clustering order is reversed on a map column, the column type
is reversed_type_impl, not map_type_impl. Therefore, we have to check
for both reversed type and map type in some places.
This patch handles reverse types in enough places to make
test_clustering_key_reverse_frozen_map pass. However, it leaves
other places (invocations of is_map() and *_cast<map_type_impl>())
as they currently are; some are protected by callers from being
invoked on reverse types, but some are quite possibly bugs untriggered
by existing tests.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the clustering order is reversed on a list column, the column type
is reversed_type_impl, not list_type_impl. Therefore, we have to check
for both reversed type and list type in some places.
This patch handles reverse types in enough places to make
test_clustering_key_reverse_frozen_list pass. However, it leaves
other places (invocations of is_list() and *_cast<list_type_impl>())
as they currently are; some are protected by callers from being
invoked on reverse types, but some are quite possibly bugs untriggered
by existing tests.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the clustering order is reversed on a set column, the column type
is reversed_type_impl, not set_type_impl. Therefore, we have to check
for both reversed type and set type in some places.
To make such checks easier, add convenience methods self_or_reversed()
and as() to abstract_type. Invoke those methods (instead of is_set()
and casts) enough to make test_clustering_key_reverse_frozen_set pass.
Leave other invocations of is_set() and *_cast<set_type_impl>() as
they are; some are protected by callers from being invoked on reverse
types, but some are quite possibly bugs untriggered by existing tests.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
This patch enables select cql statements where collection columns are
selected columns in queries where clustering column is restricted by
"IN" cql operator. Such queries are accepted by cassandra since v4.0.
The internals actually provide correct support for this feature already,
this patch simply removes relevant cql query check.
Tests: cql-pytest (testInRestrictionWithCollection)
Fixes#7743Fixes#4251
Signed-off-by: Vojtech Havel <vojtahavel@gmail.com>
Message-Id: <20210104223422.81519-1-vojtahavel@gmail.com>
* seastar d1b5d41b...a2fc9d72 (6):
> perftune.py: support passing multiple --nic options to tune multiple interfaces at once
> perftune.py recognize and sort IRQs for Mellanox NICs
> perftune.py: refactor getting of driver name into __get_driver_name()
Fixes#6266
> install-dependencies: support Manjaro
> append_challenged_posix_file_impl: optimize_queue: use max of sloppy_size_hint and speculative_size
> future: do_until: handle exception in stop condition
"
The size_estimates_mutation_reader call for global proxy
to get database from. The database is used to find keyspaces
to work with. However, it's safe to keep the local database
refernece on the reader itself.
tests: unit(debug)
"
* 'br-no-proxy-in-size-estimate-reader' of https://github.com/xemul/scylla:
size_estimate_reader: Use local db reference not global
size_estimate_reader: Keep database reference on mutation reader
size_estimate_reader: Keep database reference on virtual_reader
Conversions from views to owners have no business being implicit.
Besides, they would also cause various ambiguity problems when adding
managed_bytes_view.
From now on, memtable flush will use the strategy's interposer consumer
iff split_during_flush is enabled (disabled by default).
It has effect only for TWCS users as TWCS it's the only strategy that
goes on to implement this interposer consumer, which consists of
segregating data according to the window configuration.
Fixes#4617.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
As a preparation for interposer on flush, let's allow database write monitor
to store a shared sstable write permit, which will be released as soon as
any of the sstable writers reach the sealing stage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In this patch, we port validation/entities/collection_test.java, containing
7 tests for CQL counters. Happily, these tests did not uncover any bugs in
Scylla and all pass on both Cassandra and Scylla.
There is one small difference that I decided to ignore instead of reporting
a bug. If you try a CREATE TABLE with both counter and non-counter columns,
Scylla gives a ConfigurationException error, while Cassandra gives a more
reasonable InvalidRequest. The ported test currently allows both.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201223181325.3148928-1-nyh@scylladb.com>
In issue #7843 there were questions raised on how much does Scylla
support the notion of Unicode Equivalence, a.k.a. Unicode normalization.
Consider the Spanish letter ñ - it can be represented by a single Unicode
character 00F1, but can also be represented as a 006E (lowercase "n")
followed by a 0303 ("combining tilde"). Unicode specifies that these
two representations should be considered "equivalent" for purposes of
sorting or searching. But the following tests demonstrates that this
is not, in fact, supported in Scylla or Cassandra:
1. If you use one representation as the key, then looking up the other one
will not find the row. Scylla (and Cassandra) do *not* consider
the two strings equivalent.
2. The LIKE operator (a Scylla-only extension) doesn't know that
the single-character ñ begins with an n, or that the two-character
ñ is just a single character.
This is despite the thinking on #7843 which by using ICU in the
implementation of LIKE, we somehow got support for this. We didn't.
Refs #7843
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201229125330.3401954-1-nyh@scylladb.com>
This patch adds a reproducer for issue #7856, which is about frozen sets
and how we can in Scylla (but not in Cassandra), insert one in the "wrong"
order, but only in very specific circumstances which this reproducer
demonstrates: The bug can only be reproduced in a nested frozen collection,
and using prepared statements.
Refs #7856
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201231085500.3514263-1-nyh@scylladb.com>
Tracking both min and max timestamp will be required for memtable flush
to short-circuit interposer consumer if needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This extension is needed for future work where a memtable will be segregated
during flush into one sstable or more. So now multiple sstables can be added
to the set after a memtable flush, and compaction is only triggered at the
end.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
- the mystical `accept_predicate` is renamed to `accept_keyspace`
to be more self-descriptive
- a short comment is added to the original calculate_schema_digest
function header, mentioning that it computes schema digest
for non-system keyspaces
Refs #7854
Message-Id: <04f1435952940c64afd223bd10a315c3681b1bef.1609763443.git.sarna@scylladb.com>
In scylla-jmx, we fixed a hardcode sysconfdir in EnvironmentFile path,
realpath was used to convert the path. This patch changed to use
realpath in scylla repo to make it consistent with scylla-jmx.
Suggested-by: Pekka Enberg <penberg@scylladb.com>
Signed-off-by: Amos Kong <amos@scylladb.com>
Closes#7860
The original idea for `schema_change_test` was to ensure that **if** schema hasn't changed, the digest also remained unchanged. However, a cumbersome side effect of adding an internal distributed table (or altering one) is that all digests in `schema_change_test` are immediately invalid, because the schema changed.
Until now, each time a distributed system table was added/amended, a new test case for `schema_change_test` was generated, but this effort is not worth the effect - when a distributed system table is added, it will always propagate on its own, so generating a new test case does not bring any tangible new test coverage - it's just a pain.
To avoid this pain, `schema_change_test` now explicitly skips all internal keyspaces - which includes internal distributed tables - when calculating schema digest. That way, patches which change the way of computing the digest itself will still require adding a new test case, which is good, but, at the same time, changes to distributed tables will not force the developers to introduce needless schema features just for the sake of this test.
Tests:
* unit(dev)
* manual(rebasing on top of a change which adds two distributed system tables - all tests still passed)
Refs #7617Closes#7854
* github.com:scylladb/scylla:
schema_change_test: skip distributed system tables in digest
schema_tables: allow custom predicates in schema digest calc
alternator: drop unneeded sstring creation
system_keyspace: migrate helper functions to string_view
database: migrate find_keyspace to string views
With previous design of the schema change test, a regeneration
was necessary each time a new distributed system table was added.
It was not the original purpose of the test to keep track of new
distributed tables which simply propagate on their own,
so the test case is now modified: internal distributed tables
are not part of the schema digest anymore, which means that
changes inside them will not cause mismatches.
This change involves a one-shot regeneration of all digests,
which due to historical reasons included internal distributed
tables in the digest, but no further regenerations should ever
be necessary when a new internal distributed table is added.
For testing purposes it would be useful to be able to skip computing
schema for certain tables (namely, internal distributed tables).
In order to allow that, a function which accepts a custom predicate
is added.
It's now possible to use string views to check if a particular
table is a system table, so it's no longer needed to explicitly
create an sstring instance.
Functions for checking if the keyspace is system/internal were based
on sstring references, which is impractical compared to string views
and may lead to unnecessary creation of sstring instances.
It looks like the history of the flag begins in Cassandra's
https://issues.apache.org/jira/browse/CASSANDRA-7327 where it is
introduced to speedup tests by not needing to start the gossiper.
The thing is we always start gossiper in our cql tests, so the flag only
introduce noise. And, of course, since we want to move schema to use raft
it goes against the nature of the raft to be able to apply modification only
locally, so we better get rid of the capability ASAP.
Tests: units(dev, debug)
Message-Id: <20201230111101.4037543-2-gleb@scylladb.com>
When a node notice that it uses legacy SI tables it converts them to use
new format, but it update only local schema. It will only cause schema
discrepancy between nodes, there schema change should propagate
globally.
Fixes#7857.
Message-Id: <20201230111101.4037543-1-gleb@scylladb.com>
After 13fa2bec4c, every compaction will be performed through a filtering
reader because consumers cannot do the filtering if interposer consumer is
enabled.
It turns out that filtering_reader is adding significant overhead when regular
compactions are running. As no other compaction type need to actually do
any filtering, let's limit filtering_reader to cleanup compaction.
Alternatively, we could disable interposer consumer on behalf of cleanup,
or add support for the consumers to do the filtering themselves but that
would add lots of complexity.
Fixes#7748.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201230194516.848347-2-raphaelsc@scylladb.com>
This filter is used to discard data that doesn't belong to current
shard, but scylla will only make a sstable available to regular
compaction after it was resharded on either boot or refresh.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201230194516.848347-1-raphaelsc@scylladb.com>
This reverts commit ceb67e7728. The
"epel-release" package is needed to install the "supervisord"
package, which I somehow missed in testing...
Fixes#7851
This patch adds two simple tests for what happens when a user tries to
insert a row with one of the key column missing. The first tests confirms
that if the column is completely missing, we correctly print an error
(this was issue #3665, that was already marked fixed).
However, the second test demonstrates that we still have a bug when
the key column appears on the command, but with a null value.
In this case, instead of failing the insert (as Cassandra does),
we silently ignore it. This is the proper behavior for UNSET_VALUE,
but not for null. So the second test is marked xfail, and I opened
issue #7852 about it.
Refs #3665
Refs #7852
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201230132350.3463906-1-nyh@scylladb.com>
In a previous version of test_using_timeout.py, we had tables pre-filled
with some content labled "everything". The current version of the tests
don't use it, so drop it completely.
One test, test_per_query_timeout_large_enough, still had code that did
res = list(cql.execute(f"SELECT * FROM {table} USING TIMEOUT 24h"))
assert res == everything
this was a bug - it only works as expected if this test is run before
anything other test is run, and will fail if we ever reorder or parallelize
these tests. So drop these two lines.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201229145435.3421185-1-nyh@scylladb.com>
This miniseries rewrites two alternator request handlers from seastar threads to coroutines - since these handlers are not on a hot path and using seastar threads is way too heavy for such a simple routine.
NOTE: this pull request obviously has to wait until coroutines are fully supported in Seastar/Scylla.
Closes#7453
* github.com:scylladb/scylla:
alternator: coroutinize untagging a resource
alternator: coroutinize tagging a resource
node_exporter had been added to scylla-server package by commit
95197a09c9.
So we can enable it by default for offline installation.
Closes#7832
* github.com:scylladb/scylla:
scylla_setup: cleanup if judgments
scylla_setup: enable node_exporter for offline installation
On every compaction completion, sstable set is rebuilt from scratch.
With LCS and ~160G of data per shard, it means we'll have to create
a new sstable set with ~1000 entries whenever compaction completes,
which will likely result in reactor stalling for a significant
amount of time.
Fixes#7758.
Closes#7842
* github.com:scylladb/scylla:
table: Fix potential reactor stall on LCS compaction completion
table: decouple preparation from execution when updating sstable set
table: change rebuild_sstable_list to return new sstable set
row_cache: allow external updater to decouple preparation from execution
The range_tombstone_list always (unless misused?) contains de-overlapped
entries. There's a test_add_random that checks this, but it suffers from
several problems:
- generated "random" ranges are sequential and may only overlap on
their borders
- test uses the keys of the same prefix length
Enhance the generator part to produce a purely random sequence of ranges
with bound keys of arbitrary length. Just pay attention to generate the
"valid" individual ranges, whose start is not ahead of the end.
Also -- rename the test to reflect what it's doing and increase the
number of iterations.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201228115525.20327-1-xemul@scylladb.com>
On every compaction completion, sstable set is rebuilt from scratch.
With LCS and ~160G of data per shard, it means we'll have to create
a new sstable set with ~1000 entries whenever compaction completes,
which will likely result in reactor stalling for a significant
amount of time.
This is fixed by futurizing build_new_sstable_list(), so it will
yield whenever needed.
Fixes#7758.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
row cache now allows updater to first prepare the work, and then execute
the update atomically as the last step. let's do that when rebuilding
the set, so now new set is created in the preparation phase, and the
new set replaces the old one in the execution phase, satisfying the
atomicity requirement of row cache.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
procedure is changed to return the new set, so caller will be responsible
for replacing the old set with the new one. this will allow our future
work where building new set and enabling it will be decoupled.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
External updater may do some preparatory work like constructing a new sstable list,
and at the end atomically replace the old list by the new one.
Decoupling the preparation from execution will give us the following benefits:
- the preparation step can now yield if needed to avoid reactor stalls, as it's
been futurized.
- the execution step will now be able to provide strong exception guarantees, as
it's now decoupled from the preparation step which can be non-exception-safe.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The CQL tests in test/cql-pytest use the Python CQL driver's default
timeout for execute(), which is 10 seconds. This usually more than
enough. However, in extreme cases like noted in issue #7838, 10
seconds may not be enough. In that issue, we run a very slow debug
build on a very slow test machine, and encounter a very slow request
(a DROP KEYSPACE that needs to drop multiple tables).
So this patch increases the default timeout to an even larger
120 seconds. We don't care that this timeout is ridiculously
large - under normal operations it will never be reached, there
is no code which loops for this amount of time for example.
Tested that this patch fixes#7838 by choosing a much lower timeout
(1 second) and reproducing test failures caused by timeouts.
Fixes#7838.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201228090847.3234862-1-nyh@scylladb.com>
The MAX_LEVELS is the levels count, but sstable level (index) starts
from 0. So the maximum and valid level is MAX_LEVELS - 1.
Signed-off-by: Amos Kong <amos@scylladb.com>
Closes#7833
sstable_writer may depend on the sstable throughout its whole lifecycle.
If the sstable is freed before the sstable_writer we might hit use-after-free
as in the follwing case:
```
std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator+=(long) at /usr/include/c++/10/bits/stl_deque.h:240
(inlined by) std::operator+(std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*> const&, long) at /usr/include/c++/10/bits/stl_deque.h:378
(inlined by) std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator[](long) const at /usr/include/c++/10/bits/stl_deque.h:252
(inlined by) std::deque<sstables::compression::segmented_offsets::bucket, std::allocator<sstables::compression::segmented_offsets::bucket> >::operator[](unsigned long) at /usr/include/c++/10/bits/stl_deque.h:1327
(inlined by) sstables::compression::segmented_offsets::push_back(unsigned long, sstables::compression::segmented_offsets::state&) at ./sstables/compress.cc:214
sstables::compression::segmented_offsets::writer::push_back(unsigned long) at ./sstables/compress.hh:123
(inlined by) compressed_file_data_sink_impl<crc32_utils, (compressed_checksum_mode)1>::put(seastar::temporary_buffer<char>) at ./sstables/compress.cc:519
seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at table.cc:?
(inlined by) seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at ././seastar/include/seastar/core/iostream-impl.hh:432
seastar::output_stream<char>::flush() at table.cc:?
seastar::output_stream<char>::close() at table.cc:?
sstables::file_writer::close() at sstables.cc:?
sstables::mc::writer::~writer() at writer.cc:?
(inlined by) sstables::mc::writer::~writer() at ./sstables/mx/writer.cc:790
sstables::mc::writer::~writer() at writer.cc:?
flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at compaction.cc:?
(inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_destroy() at /usr/include/c++/10/optional:260
(inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_reset() at /usr/include/c++/10/optional:280
(inlined by) std::_Optional_payload<sstables::compaction_writer, false, false, false>::~_Optional_payload() at /usr/include/c++/10/optional:401
(inlined by) std::_Optional_base<sstables::compaction_writer, false, false>::~_Optional_base() at /usr/include/c++/10/optional:474
(inlined by) std::optional<sstables::compaction_writer>::~optional() at /usr/include/c++/10/optional:659
(inlined by) sstables::compacting_sstable_writer::~compacting_sstable_writer() at ./sstables/compaction.cc:229
(inlined by) compact_mutation<(emit_only_live_rows)0, (compact_for_sstables)1, sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_mutation() at ././mutation_compactor.hh:468
(inlined by) compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_for_compaction() at ././mutation_compactor.hh:538
(inlined by) std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::operator()(compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>*) const at /usr/include/c++/10/bits/unique_ptr.h:85
(inlined by) std::unique_ptr<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>, std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~unique_ptr() at /usr/include/c++/10/bits/unique_ptr.h:361
(inlined by) stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::~stable_flattened_mutations_consumer() at ././mutation_reader.hh:342
(inlined by) flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at ././flat_mutation_reader.hh:201
auto flat_mutation_reader::impl::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:272
(inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:383
(inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:389
(inlined by) seastar::future<void> sstables::compaction::setup<noop_compacted_fragments_consumer>(noop_compacted_fragments_consumer)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader)::{lambda()#1}::operator()() at ./sstables/compaction.cc:612
```
What happens here is that:
compressed_file_data_sink_impl(output_stream<char> out, sstables::compression* cm, sstables::local_compression lc)
: _out(std::move(out))
, _compression_metadata(cm)
, _offsets(_compression_metadata->offsets.get_writer())
, _compression(lc)
, _full_checksum(ChecksumType::init_checksum())
_compression_metadata points to a buffer held by the sstable object.
and _compression_metadata->offsets.get_writer returns a writer that keeps
a reference to the segmented_offsets in the sstables::compression
that is used in the ~writer -> close path.
Fixes#7821
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201227145726.33319-1-bhalevy@scylladb.com>
When applying a mutation partition to another if a dummy entry
from the source falls into a destination continuous range, it
can be just dropped. However, current implementation still
inserts it and then instantly removes.
Relax this code-flow by dropping the unwanted entry without
tossing it.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201224130438.11389-1-xemul@scylladb.com>
node_exporter had been added to scylla-server package by commit
95197a09c9.
So we can enable it by default for offline installation.
Signed-off-by: Amos Kong <amos@scylladb.com>
When a rows_entry is added to row_cache it's constructed from
clustering_row by unpacking all its internals and putting
them into the rows_entry's deletable_row. There's a shorter
way -- the clustering_row already has the deletale_row onboard
from which rows_entry can copy-construct its.
This lets keeping the rows_entry and deletable_row set of
constructors a bit shorter.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201224161112.20394-1-xemul@scylladb.com>
"
This series does a lot of cleanups, dead code removal, and most
importantly fixes the following things in IDL compiler tool:
* The grammar now rejects invalid identifiers, which,
in some cases, allowed to write things like `std:vector`.
* Error reporting is improved significantly and failures are
now pointing to the place of failure much more accurately.
This is done by restricting rule backtracing on those rules
which don't need it.
"
* 'idl-compiler-minor-fixes-v4' of https://github.com/ManManson/scylla:
idl: move enum and class serializer code writers to the corresponding AST classes
idl: extract writer functions for `write`, `read` and `skip` impls for classes and enums
idl: minor fixes and code simplification
idl: change argument name from `hout` to `cout` in all dependencies of `add_visitors` fn
idl: fix parsing of basic types and discard unneeded terminals
idl: remove unused functions
idl: improve error tracing in the grammar and tighten-up some grammar rules
idl: remove redundant `set_namespace` function
idl: remove unused `declare_class` function
idl: slightly change `str` and `repr` for AST types
idl: place directly executed init code into if __name__=="__main__"
To connection-less environment, we need to add node_exporter binary
to scylla-server package, not downloading it from internet.
Related #7765Fixes#2190Closes#7796
It turns out that `cql_table_large_data_handler::record_large_rows`
and `cql_table_large_data_handler::record_large_cells` were broken
for reporting static cells and static rows from the very beginning:
In case a large static cell or a large static row is encountered,
it tries to execute `db::try_record` with `nullptr` additional values,
denoting that there is no clustering key to be recorded.
These values are next passed to `qctx.execute_cql()`, which
creates `data_value` instances for each statement parameter,
hence invoking `data_value(nullptr)`.
This uses `const char*` overload which delegates to
`std::string_view` ctor overload. It is UB to pass `nullptr`
pointer to `std::string_view` ctor. Hence leading to
segmentation faults in the aforementioned large data reporting
code.
What we want here is to make a null `data_value` instead, so
just add an overload specifically for `std::nullptr_t`, which
will create a null `data_value` with `text` type.
A regression test is provided for the issue (written in
`cql-pytest` framework).
Tests: test/cql-pytest/test_large_cells_rows.py
Fixes: #6780
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201223204552.61081-1-pa.solodovnikov@scylladb.com>
In Alternator's expression parser in alternator/expressions.g, a list can be
indexed by a '[' INTEGER ']'. I had doubts whether maybe a value-reference
for the index, e.g., "something[:xyz]", should also work. So this patch adds
a test that checks whether "something[:xyz]" works, and confirms that both
DynamoDB and Alternator don't accept it and consider it a syntax error.
So Alternator's parser is correct to insist that the index be a literal
integer.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201214100302.2807647-1-nyh@scylladb.com>
* seastar 2bd8c8d088...1f5e3d3419 (5):
> Merge "Avoid fair-queue rovers overflow if not configured" from Pavel E
> doc: add a coroutines section to the tutorial
> Merge "tests/perf: add random-seed config option" from Benny
> iotune: Print parameters affecting the measurement results
> cook: Add patch cmd for ragel build (signed char confusion on aarch64)
Expand the role of AST classes to also supply methods for actually
generating the code. More changes will follow eventually until
all generation code is handled by these classes.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
"
We've encountered a number of reactor stalls
related to token_metadata that were fixed
in 052a8d036d.
This is a follow-up series that adds a clear_gently
method to token_metadata that uses continuations
to prevent reactor stalls when destroying token_metadata
objects.
Test: unit(dev), {network_topology_strategy,storage_proxy}_test(debug)
"
* tag 'token_metadata_clear_gently-v3' of github.com:bhalevy/scylla:
token_metadata: add clear_gently
token_metadata: shared_token_metadata: add mutate_token_metadata
token_metdata: futurize update_normal_tokens
abstract_replication_strategy: get_pending_address_ranges: invoke clone_only_token_map if can_yield
repair: replace_with_repair: convert to coroutine
A previous patch added test/cql-pytest/cassandra_tests - a framework for
porting Cassandra's unit tests to Python - but only ported two tiny test
files with just 3 tests. In this patch, we finally port a much larger
test file validation/entities/collection_test.java. This file includes
50 separate tests, which cover a lot of aspects of collection support,
as well as how other stuff interact with collections.
As of now, 23 (!) of these 50 tests fail, and exposed six new issues
in Scylla which I carefully documented:
Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax
Refs #7740: CQL prepared statements incomplete support for "unset" values
Refs #7743: Restrictions missing support for "IN" on tables with
collections, added in Cassandra 4.0
Refs #7745: Length of map keys and set items are incorrectly limited to 64K
in unprepared CQL
Refs #7747: Handling of multiple list updates in a single request differs
from recent Cassandra
Refs #7751: Allow selecting map values and set elements, like in
Cassandra 4.0
These issues vary in severity - some are simply new Cassandra 4.0 features
that Scylla never implemented, but one (#7740) is an old Cassandra 2.2
feature which it seems we did not implement correctly in some cases that
involve collections.
Note that there are some things that the ported tests do not include.
In a handful of places there are things which the Python driver checks,
before sending a request - not giving us an opportunity to check how
the server handles such errors. Another notable change in this port is
that the original tests repeated a lot of tests with and without a
"nodetool flush". In this port I chose to stub the flush() function -
it does NOT flush. I think the point of these tests is to check the
correctness of the CQL features - *not* to verify that memtable flush
works correctly. Doing a real memtable flush is not only slow, it also
doesn't really check much (Scylla may still serve data from cache,
not sstables). So I decided it is pointless.
An important goal of this patch is that all 50 tests (except three
skipped tests because Python has client-side checking), pass when
run on Cassandra (with test/cql-pytest/run-cassandra). This is very
important: It was very easy to make mistakes while porting the tests,
and I did make many such mistakes; But running the against Cassandra
allowed me to fix those mistakes - because the correct tests should
pass on Cassandra. And now they do.
Unfortunately, the new tests are significantly slower than what we've
been accustomed in Alternator/CQL tests. The 50 tests create more than a
hundred tables, udfs, udts, and similar slow operations - they do not
reuse anything via fixtures. The total time for these 50 tests (in dev
build mode) is around 18 seconds. Just one test - testMapWithLargePartition
is responsibe for almost half (!) of that time - we should consider in
the future whether it's worth it or can be made smaller.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201215155802.2867386-1-nyh@scylladb.com>
clear_gently gently clears the token_metadata members.
It uses continuations to allow yielding if needed
to prevent reactor stalls.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
mutate_token_metadata acquires the shared_token_metadata lock,
clones the token_metadata (using clone_async)
and calls an asynchronous functor on
the cloned copy of the token_metadata to mutate it.
If the functor is successful, the mutated clone
is set back to to the shared_token_metadata,
otherwise, the clone is destroyed.
With that, get rid of shared_token_metadata::clone
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The function complexity if O(#tokens) in the worst case
as for each endpoint token to traverses _token_to_endpoint_map
lineraly to erase the endpoint mapping if it exists.
This change renames the current implementation of
update_normal_tokens to update_normal_tokens_sync
and clones the code as a coroutine that returns a future
and may yield if needed.
Eventually we should futurize the whole token_metadata
and abstract_replication_strategy interface and get rid
of the synchronous functions. Until then the sync
version is still required from call sites that
are neither returning a future nor run in a seastar thread.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add the seastar-cpu-map.sh to the SBINFILES variable, which is used to
create symbolic links to scripts so that they appear in $PATH.
Please note that there are additional Python scripts (like perftune.py),
which are not in $PATH. That's because Python scripts are handled
separately in "install.sh" and no Python script has a "sbin" symlink. We
might want to change this in the future, though.
Fixes#6731Closes#7809
This is a temporary scaffold for weaning ourselves off
linearization. It differs from with_linearized_managed_bytes in
that it does not rely on the environment (linearization_context)
and so is easier to remove.
We use managed_bytes::data() in a few places when we know the
data is non-fragmented (such as when the small buffer optimization
is in use). We'd like to remove managed_bytes::data() as linearization
is bad, so in preparation for that, replace internal uses of data()
with the equivalent direct access.
do_linearize() is an impure function as it changes state
in linearization_context. Extract the pure parts into a new
do_linearize_pure(). This will be used to linearize managed_bytes
without a linearization_context, during the transition period where
fragmented and non-fragmented values coexist.
do_get_value() is careful to return a fragmented view, but
its only caller get_value_from_mutation() linearizes it immediately
afterwards. Linearize it sooner; this prevents mixing in
fragmented values from cells (now via IMR) and fragmented values
from partition/clustering keys. It only works now because
keys are not fragmented outside LSA, and value_view has a special
case for single-fragment values.
This helps when keys become fragmented.
Converting from bytes to bytes is nonsensical, but it helps
when transitioning to other types (managed_bytes/managed_bytes_view),
and these types will have to_bytes() conversions.
We introduce a new single-key sstable reader for sstables created by `TimeWindowCompactionStrategy`.
The reader uses the fact that sstables created by TWCS are mostly disjoint with respect to the contained `position_in_partition`s in order to avoid having multiple sstable readers opened at the same time unnecessarily. In case there are overlapping ranges (for example, in the current time-window), it performs the necessary merging (it uses `clustering_order_reader_merger`, introduced recently).
The reader uses min/max clustering key metadata present in `md` sstables in order to decide when to open or close a sstable reader.
The following experiment was performed:
1. create a TWCS table with 1 minute windows
2. fill the table with 8 equal windows of data
(each window flushed to a separate sstable)
3. perform `select * from ks.t where pk = 0 limit 1` query
with and without the change
The expectation is that with the commit, only one sstable will be opened
to fetch that one row; without the commit all 8 sstables would be opened at once.
The difference in the value of `scylla_reactor_aio_bytes_read` was measured
(value after the query minus value before the query), both with and without the commit.
With the commit, the difference was 67584.
Without the commit, the difference was 528384.
528384 / 67584 ~= 7.8.
Fixes#6418.
Closes#7437
* github.com:scylladb/scylla:
sstables: gather clustering key filtering statistics in TWCS single key reader
sstables: use time_series_sstable_set in time_window_compaction_strategy
sstable_set: new reader for TWCS single partition queries
mutation_reader_test: test clustering_order_reader_merger with time_series_sstable_set
sstable_set: introduce min_position_reader_queue
sstable_set: introduce time_series_sstable_set
sstables: add min_position and max_position accessors
sstable_set: make create_single_key_sstable_reader a virtual method
clustering_order_reader_merger: fix the 0 readers case
The following experiment was performed:
1. create a TWCS table with 1 minute windows
2. fill the table with 8 windows of data
(each window flushed to a separate sstable)
3. perform `select * from ks.t where pk = 0 limit 1` query
with and without the change
The expectation is that with the commit, only one sstable will be opened
to fetch that one row; without the commit all 8 sstables would be opened at once.
The difference in the value of `scylla_reactor_aio_bytes_read` was measured
(value after the query minus value before the query), both with and without the commit.
With the commit, the difference was 67584.
Without the commit, the difference was 528384.
528384 / 67584 ~= 7.8.
Fixes https://github.com/scylladb/scylla/issues/6418.
This commit introduces a new implementation of `create_single_key_sstable_reader`
in `time_series_sstable_set` dedicated for TWCS-created sstables.
It uses the fact that such sstables are mostly disjoint with respect to
contained `position_in_partition`s in order to decrease the number of
sstable readers that are opened at the same time.
The implementation uses `clustering_order_reader_merger` under the hood.
The reader assumes that the schema does not have static columns and none
of the queried sstable contain partition tombstones; also, it assumes
that the sstables have the min/max clustering key metadata in order for
the implementation to be efficient. Thus, if we detect that some of
these assumptions aren't true, we fall back to the old implementation.
This is a queue of readers of sstables in a time_series_sstable_set,
returning the readers in order of the smallest position_in_partition
that the sstables have. It uses the min/max clustering key sstable
metadata.
The readers are opened lazily, at the moment of being returned.
At this moment it is a slightly less efficient version of
bag_sstable_set, but in following commits we will use the new data
structures to gain advantage in single partition queries
for sstables created by TimeWindowCompactionStrategy.
... of sstable_set_impl.
Soon we shall provide a specialized implementation in one of the
`sstable_set_impl` derived classes.
The existing implementation is used as the default one.
There were two problems with handling conflicting equalities on the same PK column (eg, c=1 AND c=0):
1. When the column is indexed, Scylla crashed (#7772)
2. Computing ranges and slices was throwing an exception
This series fixes them both; it also happens to resolve some old TODOs from restriction_test.
Tests: unit (dev, debug)
Closes#7804
* github.com:scylladb/scylla:
cql3: Fix value_for when restriction is impossible
cql3: Fix range computation for p=1 AND p=1
Previously, single_column_restrictions::value_for() assumed that a
column's restriction specifies exactly one value for the column. But
since 37ebe521e3, multiple equalities on the same column are allowed,
so the restriction could be a conjunction of conflicting
equalities (eg, c=1 AND c=0). That violates an assert and crashes
Scylla.
This patch fixes value_for() by gracefully handling the
impossible-restriction case.
Fixes#7772
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Previously compute_bounds was assuming that primary-key columns are
restricted by exactly one equality, resulting in the following error:
query 'select p from t where p=1 and p=1' failed:
std::bad_variant_access (std::get: wrong index for variant)
This patch removes that assumption and deals correctly with the
multiple-equalities case. As a byproduct, it also stops raising
"invalid null value" exceptions for null RHS values.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Split `write`, `read` and `skip` serializer function writers to
separate functions in `handle_class` and `handle_enum` functions,
which slightly improves readability.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
* Introduce `ns_qualified_name` and `template_params_str` functions
to simplify code a little bit in `handle_enum` and `handle_class`
functions.
* Previously each serializer had a separate namespace open-close
statements, unify them into a single namespace scope.
* Fix a few more `hout` -> `cout` argument names.
* Rename `template` pattern to `template_decl` to improve clarity.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Prior to the patch all functions that are called from `add_visitors`
and this function itself declared the argument denoting the output
file as `hout`. Though, this was quite misleading since `hout`
is meant to be header file with declarations, while `cout` is an
implementation file.
These functions write to implmentation file hence `hout` should
be changed to `cout` to avoid confusion.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Prior to the patch `btype` production was using `with_colon`
rule, which accidentally supported parsing both numbers and
identifiers (along with other invalid inputs, such as "123asd").
It was changed to use `ns_qualified_ident` and those places
which can accept numeric constants, are explicitly listing
it as an alternative, e.g. template parameter list.
Unfortunately, I had to make TemplateType to explicitly construct
`BasicType` instances from numeric constants in template arguments
list. This is exactly the way it was handled before, though.
But nonetheless, this should be addressed sometime later.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Remove the following functions since they are not used:
* `open_namespaces`
* `close_namespaces`
* `flat_template`
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This patch replaces use of some handwritten rules to use their
alternatives already defined in `pyparsing.pyparsing_common`
class, i.e.: `number`, `identifier` productions.
Changed ignore patterns for comments to use pre-defined
`pp.cppStyleComment` instead of hand-written combination of
'//'-style and C-style comment rules.
Operator '-' is now used whenever possible to improve debugging
experience: it disables default backtracking for productions
so that compiler fails earlier and can now point more precisely
to a place in the input string where it failed instead of
backtracking to the top-level rule and reporting error there.
Template names and class names now use `ns_qualified_ident`
rule instead of `with_colon` which prevents grammar from
matching invalid identifiers, such as `std:vector`.
Many places are using the updated `identifier` production, which
is working correctly unlike its predecessor: now inputs
such as `1ident` are considered invalid.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Surround string representation with angle brackets. This improves
readability when printing debug output.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Since idl compiler is not intended to be used as a module to other
python build scripts, move initialization code under an if checking
that current module name is "__main__".
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
When recycling a segment in O_DSYNC mode if the size of the segment
is neither shrunk nor grown, avoid calling file::truncate() or
file::allocate().
Message-Id: <20201215182332.1017339-2-kostja@scylladb.com>
"
This patch series consists of the following patches:
1. The first one turned out to be a massive rewrite of almost
everything in `idl-compiler.py`. It aims to decouple parser
structures from the internal representation which is used
in the code-generation itself.
Prior to the patch everything was working with raw token lists and
the code was extremely fragile and hard to understand and modify.
Moreover, every change in the parser code caused a cascade effect
of breaking things at many different places, since they were relying
on the exact format of output produced by parsing rules.
Now there is a bunch of supplementary AST structures which provide
hierarchical and strongly typed structure as the output of parsing
routine.
It is much easier to verify (by the means of `isinstance`, for example)
and extend since the internal structures used in code-generation are
decoupled from the structure of parsing rules, which are now controlled
by custom parse actions providing high-level abstractions.
It is tested manually by checking that the old code produces exactly
the same autogenerated sources for all Scylla IDLs as the new one.
2 and 3. Cosmetics changes only: fixed a few typos and moved from
old-fashioned `string.Template` to python f-strings.
This improves readability of the idl-compiler code by a lot.
Only one non-functional whitespace change introduced.
4. This patch adds a very basic support for the parser to
understand `const` specifier in case it's used with a template
parameter for a data member in a class, e.g.
struct my_struct {
std::vector<const raft::log_entry> entries;
};
It actually does two things:
* Adjusts `static_asserts` in corresponding serializer methods
to match const-ness of fields.
* Defines a second serializer specialization for const type in
`.dist.hh` right next to non-const one.
This seems to be sufficient for raft-related uses for now.
Please note there is no support for the following cases, though:
const std::vector<raft::log_entry> entries;
const raft::term_t term;
None of the existing IDLs are affected by the change, so that
we can gradually improve on the feature and write the idl
unit-tests to increase test coverage with time.
5. A basic unit-test that writes a test struct with an
`std::vector<S<const T>>` field and reads it back to verify
that serialization works correctly.
6. Basic documentation for AST classes.
TODO: should also update the docs in `docs/IDL.md`. But it is already
quite outdated, and some changes would even be out of scope for this
patch set.
"
* 'idl-compiler-refactor-v5' of https://github.com/ManManson/scylla:
idl: add docstrings for AST classes
idl: add unit-test for `const` specifiers feature
idl: allow to parse `const` specifiers for template arguments
idl: fix a few typos in idl-compiler
idl: switch from `string.Template` to python f-strings and format string in idl-compiler
idl: Decouple idl-compiler data structures from grammar structure
feed_writer() eats exception and transforms it into an end of stream
instead. Downstream validators hate when this happens.
Fixes#7482
Message-Id: <20201216090038.GB3244976@scylladb.com>
A tool which lists all partitions contained in an sstable index. As all
partitions in an sstable are indexed, this tool can be used to find out
what partitions are contained in a given sstable.
The printout has the following format:
$pos: $human_readable_value (pk{$raw_hex_value})
Where:
* $pos: the position of the partition in the (decompressed) data file
* $human_readable_value: the human readable partition key
* $raw_hex_value: the raw hexadecimal value of the binary representation
of the partition key
For now the tool requires the types making up the partition key to be
specified on the command line, using the `--type|-t` command line
argument, using the Cassandra type class name notation for types.
As these are not assumed to be widely known, this patch includes a
document mapping all cql3 types to their Cassandra type class name
equivalent (but not just).
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201208092323.101349-1-bdenes@scylladb.com>
Fixes#7732
When truncating with auto_snapshot on, we try to verify the low rp mark
from the CF against the sstables discarded by the truncation timestamp.
However, in a scenario like:
Fill memtables
Flush
Truncate with snapshot A
Fill memtables some more
Truncate
Move snapshot A to upload + refresh (load old tables)
Truncate
The last op will assert, because while we have sstables loaded, which
will be discarded now, we did not in fact generate any _new_ ones
(since memtables are empty), and the RP we get back from discard is
one from an earlier generation set.
(Any permutation of events that create the situation "empty memtable" +
"non-empty sstables with only old tables" will generate the same error).
Added a check that before flushing checks if we actually have any
data, and if not, does not uphold the RP relation assert.
Closes#7799
This series makes sure that before the table is dropped, all pending memtable flushes related to its memtables would finish.
Normally, flushes are not problematic in Scylla, because all tables are by default `auto_snapshot=true`, which also implies that a table is flushed before being dropped. However, with `auto_snapshot=false` the flush is not attempted at all. It leads to the following race:
1. Run a node with `auto_snapshot=false`
2. Schedule a memtable flush (e.g. via nodetool)
3. Get preempted in the middle of the flush
4. Drop the table
5. The flush that already started wakes up and starts operating on freed memory, which causes a segfault
Tests: manual(artificially preempting for a long time in bullet point 2. to ensure that the race occurs; segfaults were 100% reproducible before the series and do not happen anymore after the series is applied)
Fixes#7792Closes#7798
* github.com:scylladb/scylla:
database: add flushes to waiting for pending operations
table: unify waiting for pending operations
database: add a phaser for flush operations
database: add waiting for pending streams on table drop
This patch introduces very limited support for declaring `const`
template parameters in data members.
It's not covering all the cases, e.g.
`const type member_variable` and `const template_def<T1, T2, ...>`
syntax is not supported at the moment.
Though the changes are enough for raft-related use: this makes it
possible to declare `std::vector<raft::log_entries_ptr>` (aka
`std::vector<lw_shared_ptr<const raft::log_entry>>`) in the IDL.
Existing IDL files are not affected in any way.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Move to a modern and lightweight syntax of f-strings
introduced in python 3.6. It improves readability and provides
greater flexibility.
A few places are now using format strings instead, though.
In case when multiline substitution variable is used, the template
string should be first re-indented and only after that the
formatting should be applied, or we can end up with screwed
indentation the in generated sources.
This change introduces one invisible whitespace change
in `query.dist.impl.hh`, otherwise all generated code is exactly
the same.
Tests: build(dev) and diff genetated IDL sources by hand
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Instead of operating on the raw lists of tokens, transform them into
typed structures representation, which makes the code by many orders of
magnitude simpler to read, understand and extend.
This includes sweeping changes throughout the whole source code of the
tool, because almost every function was tightly coupled to the way
data was passed down from the parser right to the code generation
routines.
Tested manually by checking that old generated sources are precisely
the same as the new generated sources.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Pending flushes can participate in races when a table
with auto_snapshot==false is dropped. The race is as follows:
1. A flush of table T is initiated
2. The flush operation is preempted
3. Table T is dropped without flushing, because it has auto_snapshot off
4. The flush operation from (2.) wakes up and continues
working on table T, which is already dropped
5. Segfault/memory corruption
To prevent such races, a phaser for pending flushes is introduced
We already wait for pending reads and writes, so for completeness
we should also wait for all pending stream operations to finish
before dropping the table to avoid inconsistencies.
Download node_exporter in frozen image to prepare adding node_exporter
to relocatable pacakge.
Related #2190Closes#7765
[avi: updated toolchain, x86_64/aarch64/s390x]
Alternator tracing tests require the cluster to have the 'always'
isolation level configured to work properly. If that's not the case,
the tests will fail due to not having CAS-related traces present
in the logs. In order to help the users fix their configuration,
a helper message is printed before the test case is performed.
Automatic tests do not need this, because they are all ran with
matching isolation level, but this message could greatly improve
the user experience for manual tests.
Message-Id: <62bcbf60e674f57a55c9573852b6a28f99cbf408.1607949754.git.sarna@scylladb.com>
The outcome of alternator tracing tests was that tracing probability
was always set to 0 after the test was finished. That makes sense
for most test runs, but manual tests can work on existing clusters
with tracing probability set to some other value. Due to preserve
previous trace probability, the value is now extracted and stored,
so that it can be restored after the test is done.
Message-Id: <94f829b63f92847b4abb3b16f228bf9870f90c2e.1607949754.git.sarna@scylladb.com>
Normally a file size should be aligned around block size, since
we never write to it any unaligned size. However, we're not
protected against partial writes.
Just to be safe, align up the amount of bytes to zerofill
when recycling a segment.
Message-Id: <20201211142628.608269-4-kostja@scylladb.com>
Three tests in test_streams.py run update_table() on a table without
waiting for it to complete, and then call update_table() on the same
table or delete it. This always works in Scylla, and usually works in
AWS, but if we reach the second call, it may fail because the previous
update_table() did not take effect yet. We sometimes see these failures
when running the Alternator test suite against AWS.
So in this patch, after an each update_table() we wait for the table
to return from UPDATING to ACTIVE status.
The entire Alternator test suite now passes (or skipped) on AWS,
so: Fixes#7778.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213164931.2767236-1-nyh@scylladb.com>
The test test_query_filter.py::test_query_filter_paging fails on AWS
and shouldn't fail, so this patch fixes the test. Note that this is
only a test problem - no fix is needed for Alternator itself.
The test reads 20 results with 1-result pages, and assumed that
21 pages are returned. The 21st page may happen because when the
server returns the 20th, it might not yet know there will be no
additional results, so another page is needed - and will be empty.
Still a different implementation might notice that the last page
completed the iteration, and not return an extra empty page. This is
perfectly fine, and this is what AWS DynamoDB does today - and should
not be considered an error.
Refs #7778
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213143612.2761943-1-nyh@scylladb.com>
When request signature checking is enabled in Alternator, each request
should come with the appropriate Authorization header. Most errors in
this preparing this header will result in an InvalidSignatureException
response; But DynamoDB returns a more specific error when this header is
completely missing: MissingAuthenticationTokenException. We should do the
same, but before this patch we return InvalidSignatureException also for
a missing header.
The test test_authorization.py::test_no_authorization_header used to
enshrine our wrong error message, and failed when run against AWS.
After this patch, we fix the error message and the test - which now
passes against both Alternator and AWS.
Refs #7778.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213133825.2759357-1-nyh@scylladb.com>
This series allows setting per-query timeout via CQL. It's possible via the existing `USING` clause, which is extended to be available for `SELECT` statement as well. This parameter accepts a duration and can also be provided as a marker.
The parameter acts as a regular part of the `USING` clause, which means that it can be used along with `USING TIMESTAMP` and `USING TTL` without issues.
The series comes with a pytest test suite.
Examples:
```cql
SELECT * FROM t USING TIMEOUT 200ms;
```
```cql
INSERT INTO t(a,b,c) VALUES (1,2,3) USING TIMESTAMP 42 AND TIMEOUT 50ms;
```
Working with prepared statements works as usual - the timeout parameter can be
explicitly defined or provided as a marker:
```cql
SELECT * FROM t USING TIMEOUT ?;
```
```cql
INSERT INTO t(a,b,c) VALUES (?,?,?) USING TIMESTAMP 42 AND TIMEOUT 50ms;
```
Tests: unit(dev)
Fixes#7777Closes#7781
* github.com:scylladb/scylla:
test: add prepared statement tests to USING TIMEOUT suite
docs: add an entry about USING TIMEOUT
test: add a test suite for USING TIMEOUT
storage_proxy: start propagating local timeouts as timeouts
cql3: allow USING clause for SELECT statement
cql3: add TIMEOUT attribute to the parser
cql3: add per-query timeout to select statement
cql3: add per-query timeout to batch statement
cql3: add per-query timeout to modification statement
cql3: add timeout to cql attributes
First of all, select statement is extended with an 'attrs' field,
which keeps the per-query attributes. Currently, only TIMEOUT
parameter is legal to use, since TIMESTAMP and TTL bear no meaning
for reads.
Secondly, if TIMEOUT attribute is set, it will be used as the effective
timeout for a particular query.
1. It's unused since cbe510d1b8
2. It's unsafe to keep a reference to token_metadata&
potentially across yield points.
The higher-level motivation is to make
storage_service::get_token_metadata() private so we
can control better how it's used.
For cdc, if the token_metadata is going to be needed
to the future, it'd be better get it from
db_context::_proxy.get_token_metadata_ptr().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201213162351.52224-2-bhalevy@scylladb.com>
"
This series fixes use-after-free via token_metadata&
We may currently get a token_metadata& via get_token_metadata() and
use it across yield points in a couple of sites:
- do_decommission_removenode_with_repair
- get_new_source_ranges
To fix that, get_token_metadata_ptr and hold on to it
across yielding.
Fixes#7790
Dtest: update_cluster_layout_tests:TestUpdateClusterLayout.simple_removenode_2_test(debug)
Test: unit(dev)
"
* tag 'storage_service-token_metadata_ptr-v2' of github.com:bhalevy/scylla:
storage_service: get_new_source_ranges: don't hold token_metadata& across yield point
storage_service: get_changed_ranges_for_leaving: no need to maybe_yield for each token_range
storage_service: get_changed_ranges_for_leaving: release token_metadata_ptr sooner
storage_service: get_changed_ranges_for_leaving: don't hold token_metadata& across yield
Provide the token_metadata& to get_new_source_ranges by the caller,
who keeps it valid throughout the call.
Note that there is no need to clone_only_token_map
since the token_metadata_ptr is immutable and can be
used just as well for calling strat.get_range_addresses.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When yielding in clone_only_token_map or clone_after_all_left
the token_metadata got with get_token_metadata() may go away.
Use get_token_metadata_ptr() instead to hold on to it.
And with that, we don't need to clone_only_token_map.
`metadata` is not modified by calculate_natural_endpoints, so we
can just refer to the immutable copy retrieved with
get_token_metadata_ptr.
Fixes#7790
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
The validate_column_family() helper uses the global proxy
reference to get database from. Fortunatelly, all the callers
of it can provide one via argument.
tests: unit(dev)
"
* 'br-no-proxy-in-validate' of https://github.com/xemul/scylla:
validation: Remove get_local_storage_proxy call
client_state: Call validate_column_family() with database arg
client_state: Add database& arg to has_column_family_access
storage_proxy: Add .local_db() getters
validate: Mark database argument const
"
The initial intent was to remove call for global storage service from
secondary index manager's create_view_for_index(), but while fixing it
one of intermediate schema table's helper managed to benefit from it
by re-using the database reference flying by.
The cleanup is done by simply pushing the database reference along the
stack from the code that already has it down the create_view_for_index().
tests: unit(dev)
"
* 'br-no-storages-in-index-and-schema' of https://github.com/xemul/scylla:
schema-tables: Use db from make_update_table_mutations in make_update_indices_mutations
schema-tables: Add database argument to make_update_table_mutations
schema-tables: Factor out calls getting database instance
index-manager: Move feature evaluation one level up
`ops` might be passed as a disengaged shared_ptr when called
from `decommission_with_repair`.
In this case we need to propagate to sync_data_using_repair a
disengaged std::optional<utils::UUID>.
Fixes#7788
DTest: update_cluster_layout_tests:TestUpdateClusterLayout.verify_latest_copy_decommission_node_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201213073743.331253-1-bhalevy@scylladb.com>
* seastar 8b400c7b45...2de43eb6bf (3):
> core: show span free sizes correctly in diagnostics
> Merge "IO queues to share capacities" from Pavel E
> file: make_file_impl: determine blockdev using st_mode
The switch to clang disabled the clang-specific -Wunused-value
since it generated some harmless warnings. Unfortunately, that also
prevent [[nodiscard]] violations from warning.
Fix by clearing all instances of the warning (including [[nodiscard]]
violations that crept in while it was disabled) and reinstating the warning.
Closes#7767
* github.com:scylladb/scylla:
build: reinstate -Wunused-value warning for [[nodiscard]]
test: lib: don't ignore future in compare_readers()
test: mutation_test: check both ranges when comparing summaries
serialializer: silence unused value warning in variant deserializer
tuned 2.11.0-9 and later writes to kerned.sched_wakeup_granularity_ns
and other sysctl tunables that we so laboriously tuned, dropping
performance by a factor of 5 (due to increased latency). Fix by
obsoleting tuned during install (in effect, we are a better tuned,
at least for us).
Not needed for .deb, since debian/ubunto do not install tuned by
default.
Fixes#7696Closes#7776
Two halves of the tunnel finally connect -- the
latter helper needs the local database instance and
is only called by the former one which already has it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are 3 callers of this helper (cdc, migration manager and tests)
and all of them already have the database object at hands.
The argument will be used by next patch to remove call for global
storage proxy instance from make_update_indices_mutations.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The make_update_indices_mutations gets database instance
for two things -- to find the cf to work with and to get
the value of a feature for index view creation.
To suit both and to remove calls for global storage proxy
and service instances get the database once in the
function entrance. Next patch will clean this further.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The create_view_for_index needs to know the state of the
correct-idx-token-in-secondary-index feature. To get one
it takes quite a long route through global storage service
instance.
Since there's only one caller of the method in question,
and the method is called in a loop, it's a bit faster to
get the feature value in caller and pass it in argument.
This will also help to get rid of the call for global
storage service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The get_next_partition uses global proxy instance to get
the local database reference. Now it's available in the
reader object itself, so it's possible to remove this
call for global storage proxy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This reader uses local databse instance in its get_next_partition
method to find keyspaces to work with
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is used in validate_column_family. The last caller of it was removed by
previous patch, so we may kill the helper itself
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The previous patch brought the databse reference arg. And since
the currently called validate_column_family() overload _just_
gets the database from global proxy, it's better to shortcut.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is called from cql3/statements' check_access methods and from thrift
handlers. The former have proxy argument from which they can get the
database. The latter already have the database itself on board.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A sequel to #7692.
This series gets rid of linearization when validating collections and tuple types. (Other types were already validated without linearizing).
The necessary helpers for reading from fragmented buffers were introduced in #7692. All this series does is put them to use in `validate()`.
Refs: #6138Closes#7770
* github.com:scylladb/scylla:
types: add single-fragment optimization in validate()
utils: fragment_range: add with_simplified()
cql3: statements: select_statement: remove unnecessary use of with_linearized
cql3: maps: remove unnecessary use of with_linearized
cql3: lists: remove unnecessary use of with_linearized
cql3: tuples: remove unnecessary use of with_linearized
cql3: sets: remove unnecessary use of with_linearized
cql3: tuples: remove unnecessary use of with_linearized
cql3: attributes: remove unnecessary uses of with_linearized
types: validate lists without linearizing
types: validate tuples without linearizing
types: validate sets without linearizing
types: validate maps without linearizing
types: template abstract_type::validate on FragmentedView
types: validate_visitor: transition from FragmentRange to FragmentedView
utils: fragmented_temporary_buffer: add empty() to FragmentedView
utils: fragmented_temporary_buffer: don't add to null pointer
Manipulating fragmented views is costlier that manipulating contiguous views,
so let's detect the common situation when the fragmented view is actually
contiguous underneath, and make use of that.
Note: this optimization is only useful for big types. For trivial types,
validation usually only checks the size of the view.
Reading from contiguous memory (bytes_view) is significantly simpler
runtime-wise than reading from a fragmented view, due to less state and less
branching, so we often want to convert a fragmented view to a simple view before
processing it, if the fragmented view contains at most one fragment, which is
common. with_simplified() does just that.
This is primarily a stylistic change. It makes the interface more consistent
with deserialize(). It will also allow us to call `validate()` for collection
elements in `validate_aux()`.
This will allow us to easily get rid of linearizations when validating
collections and tuples, because the helpers used in validate_aux() already
have FragmentedView overloads.
When fragmented_temporary_buffer::view is created from a bytes_view,
_current is null. In that case, in remove_current(), null pointer offset
happens, and ubsan complains. Fix that.
The heuristic of STCS reshape is correct, and it built the compaction
descriptor correctly, but forgot to return it to the caller, so no
reshape was ever done on behalf of STCS even when the strategy
needed it.
Fixes#7774.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201209175044.1609102-1-raphaelsc@scylladb.com>
Currently removenode works like below:
- The coordinator node advertises the node to be removed in
REMOVING_TOKEN status in gossip
- Existing nodes learn the node in REMOVING_TOKEN status
- Existing nodes sync data for the range it owns
- Existing nodes send notification to the coordinator
- The coordinator node waits for notification and announce the node in
REMOVED_TOKEN
Current problems:
- Existing nodes do not tell the coordinator if the data sync is ok or failed.
- The coordinator can not abort the removenode operation in case of error
- Failed removenode operation will make the node to be removed in
REMOVING_TOKEN forever.
- The removenode runs in best effort mode which may cause data
consistency issues.
It means if a node that owns the range after the removenode
operation is down during the operation, the removenode node operation
will continue to succeed without requiring that node to perform data
syncing. This can cause data consistency issues.
For example, Five nodes in the cluster, RF = 3, for a range, n1, n2,
n3 is the old replicas, n2 is being removed, after the removenode
operation, the new replicas are n1, n5, n3. If n3 is down during the
removenode operation, only n1 will be used to sync data with the new
owner n5. This will break QUORUM read consistency if n1 happens to
miss some writes.
Improvements in this patch:
- This patch makes the removenode safe by default.
We require all nodes in the cluster to participate in the removenode operation and
sync data if needed. We fail the removenode operation if any of them is down or
fails.
If the user want the removenode operation to succeed even if some of the nodes
are not available, the user has to explicitly pass a list of nodes that can be
skipped for the operation.
$ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id>
Example restful api:
$ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5"
- The coordinator can abort data sync on existing nodes
For example, if one of the nodes fails to sync data. It makes no sense for
other nodes to continue to sync data because the whole operation will
fail anyway.
- The coordinator can decide which nodes to ignore and pass the decision
to other nodes
Previously, there is no way for the coordinator to tell existing nodes
to run in strict mode or best effort mode. Users will have to modify
config file or run a restful api cmd on all the nodes to select strict
or best effort mode. With this patch, the cluster wide configuration is
eliminated.
Fixes#7359Closes#7626
Verify that the input types are iterators and their value types are compatible
with the compare function.
Because some of the inputs were not actually valid iterators, they are adjusted
too.
Closes#7631
* github.com:scylladb/scylla:
types: add constraint on lexicographical_tri_compare()
composite: make composite::iterator a real input_iterator
compound: make compount_type::iterator a real input_iterator
UpdateItem's "ADD" operation usually adds elements to an existing set
or adds a number to an existing counter. But it can *also* be used
to create a new set or counter (as if adding to an empty set or zero).
We unfortunately did not have a test for this case (creating a new set
or counter), and when I wrote such a test now, I discovered the
implementation was missing. So this patch adds both the test and the
implementation. The new test used to fail before this patch, and passes
with it - and passes on DynamoDB.
Note that we only had this bug for the newer UpdateItem syntax.
For the old AttributeUpdates syntax, we already support ADD actions
on missing attributes, and already tested it in test_update_item_add().
I just forgot to test the same thing for the newer syntax, so I missed
this bug :-(
Fixes#7763.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207085135.2551845-1-nyh@scylladb.com>
Snitch name needs to be exchanged within cluster once, on shadow
round, so joining nodes cannot use wrong snitch. The snitch names
are compared on bootstrap and on normal node start.
If the cluster already used mixed snitches, the upgrade to this
version will fail. In this case customer needs to add a node with
correct snitch for every node with the wrong snitch, then put
down the nodes with the wrong snitch and only then do the upgrade.
Fixes#6832Closes#7739
Whereas in CQL the client can pass a timeout parameter to the server, in
the DynamoDB API there is no such feature; The server needs to choose
reasonable timeouts for its own internal operations - e.g., writes to disk,
querying other replicas, etc.
Until now, Alternator had a fixed timeout of 10 seconds for its
requests. This choice was reasonable - it is much higher than we expect
during normal operations, and still lower than the client-side timeouts
that some DynamoDB libraries have (boto3 has a one-minute timeout).
However, there's nothing holy about this number of 10 seconds, some
installations might want to change this default.
So this patch adds a configuration option, "--alternator-timeout-in-ms",
to choose this timeout. As before, it defaults to 10 seconds (10,000ms).
In particular, some test runs are unusually slow - consider for example
testing a debug build (which is already very slow) in an extremely
over-comitted test host. In some cases (see issue #7706) we noticed
the 10 second timeout was not enough. So in this patch we increase the
default timeout chosen in the "test/alternator/run" script to 30 seconds.
Please note that as the code is structured today, this timeout only
applies to some operations, such as GetItem, UpdateItem or Scan, but
does not apply to CreateTable, for example. This is a pre-existing
issue that this patch does not change.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207122758.2570332-1-nyh@scylladb.com>
This reverts commit dc77d128e9. It was reverted
due to a strange and unexplained diff, which is now explained. The
HEAD on the working directory being pulled from was set back, so git
thought it was merging the intended commits, plus all the work that was
committed from HEAD to master. So it is safe to restore it.
"
The multishard_mutation_query test is toooo slow when built
with clang in dev mode. By reducing the number of scans it's
possible to shrink the full suite run time from half an hour
down to ~3 minutes.
tests: unit(dev)
"
* 'br-devel-mode-tests' of https://github.com/xemul/scylla:
test: Make multishard_mutation_query test do less scans
configure: Add -DDEVEL to dev build flags
When built by clang this dev-mode test takes ~30 minutes to
complete. Let's reduce this time by reducing the scale of
the test if DEVEL is set.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
scylla_setup command suggestion does not shows an argument of --io-setup,
because we mistakely stores bool value on it (recognized as 'store_true').
We always need to print '--io-setup X' on the suggestion instead.
Also, --nic is currently ignored by command suggestion, need to print it just like other options.
Related #7395Closes#7724
* github.com:scylladb/scylla:
scylla_setup: print --swap-directory and --swap-size on command suggestion
scylla_setup: print --nic on command suggestion
scylla_setup: fix wrong command suggestion on --io-setup scylla_setup command suggestion does not shows an argument of --io-setup, because we mistakely stores bool value on it (recognized as 'store_true'). We always need to print '--io-setup X' on the suggestion instead.
A sequel to #7692.
This series gets rid of linearization in `serialize_for_cql`, which serializes collections and user types from `collection_mutation_view` to CQL. We switch from `bytes` to `bytes_ostream` as the intermediate buffer type.
The only user of of `serialize_for_cql` immediately copies the result to another `bytes_ostream`. We could avoid some copies and allocations by writing to the final `bytes_ostream` directly, but it's currently hidden behind a template.
Before this series, `serialize_for_cql_aux()` delegated the actual writing to `collection_type_impl::pack` and `tuple_type_impl::build_value`, by passing them an intermediate `vector`. After this patch, the writing is done directly in `serialize_for_cql_aux()`. Pros: we avoid the overhead of creating an intermediate vector, without bloating the source code (because creating that intermediate vector requires just as much code as serializing the values right away). Cons: we duplicate the CQL collection format knowledge contained in `collection_type_impl::pack` and `tuple_type_impl::build_value`.
Refs: #6138Closes#7771
* github.com:scylladb/scylla:
types: switch serialize_for_cql from bytes to bytes_ostream
types: switch serialize_for_cql_aux from bytes to bytes_ostream
types: serialize user types to bytes_ostream
types: serialize lists to bytes_ostream
types: serialize sets to bytes_ostream
types: serialize maps to bytes_ostream
utils: fragment_range: use range-based for loop instead of boost::for_each
types: add write_collection_value() overload for bytes_ostream and value_view
Increase accepted disk-to-RAM ratio to 105 to accomodate even 7.5GB of
RAM for one NVMe log various reasons for not recommending the instance
type.
Fixes#7587Closes#7600
This change enhances the toppartitions api to also return
the cardinality of the read and write sample sets. It now uses
the size() method of space_saving_top_k class, counting the unique
operations in the sampled set for up to the given capacity.
Fixes#4089Closes#7766
When an Alternator table has partition keys or sort keys of type "bytes"
(blobs), a Scan or Query which required paging used to fail - we used
an incorrect function to output LastEvaluatedKey (which tells the user
where to continue at the next page), and this incorrect function was
correct for strings and numbers - but NOT for bytes (for bytes, we
need to encode them as base-64).
This patch also includes two tests - for bytes partition key and
for bytes sort key - that failed before this patch and now pass.
The test test_fetch_from_system_tables also used to fail after a
Limit was added to it, because one of the tables it scans had a bytes
key. That test is also fixed by this patch.
Fixes#7768
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207175957.2585456-1-nyh@scylladb.com>
Up until now, Scylla's debian packages dependencies versions were
unspecified. This was due to a technical difficulty to determine
the version of the dependent upon packages (such as scylla-python3
or scylla-jmx). Now, when those packages are also built as part of
this repo and are built with a version identical to the server package
itself we can depend all of our packages with explicit versions.
The motivation for this change is that if a user tries to install
a specific Scylla version by installing a specific meta package,
it will silently drag in the latest components instead of the ones
of the requested versions.
The expected change in behavior is that after this change an attempt
to install a metapackage with version which is not the latest will fail
with an explicit error hinting the user what other packages of the same
version should be explicitly included in the command line.
Fixes#5514Closes#7727
The switch to clang disabled the clang-specific -Wunused-value
since it generated some harmless warnings. Unfortunately, that also
prevent [[nodiscard]] violations from warning.
Fix by reinstating the warning, now that all instances of the warning
have been fixed.
A copy/paste error means we ignore the termination of one of the
ranges. Change the comma expression to a disjunction to avoid
the unused value warning from clang.
The code is not perfect, since if the two ranges are not the same
size we'll invoke undefined behavior, but it is no worse than before
(where we ignored the comparison completely).
The variant deserializer uses a fold expression to implement
an if-tree with a short-circuit, producing an intermediate boolean
value to terminate evaluation. This intermediate value is unneeded,
but evokes a warning from clang when -Wunused-value is enabled.
Since we want to enable the warning, add a cast to void to ignore
the intermediate value.
We want to pass bytes_ostream to this loop in later commits.
bytes_ostream does not conform to some boost concepts required by
boost::for_each, so let's just use C++'s native loop.
When getting local ranges, an assumption is made that
if a range does not contain an end or when its end is a maximum token,
then it must contain a start. This assumption proven not true
during manual tests, so it's now fortified with an additional check.
Here's a gdb output for a set of local ranges which causes an assertion
failure when calling `get_local_ranges` on it:
(gdb) p ranges
$1 = std::vector of length 2, capacity 2 = {{_interval = {_start = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {_kind = dht::token_kind::before_all_keys,
_data = 0}, _inclusive = false}}, _end = std::optional<interval_bound<dht::token>> [no contained value], _singular = false}}, {_interval = {
_start = std::optional<interval_bound<dht::token>> [no contained value], _end = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {
_kind = dht::token_kind::before_all_keys, _data = 0}, _inclusive = true}}, _singular = false}}}
Closes#7764
The test test_fetch_from_system_tables tests Alternator's system-table
feature by reading from all system tables. The intention was to confirm
we don't crash reading any of them - as they have different schemas and
can run into different problems (we had such problems in the initial
implementation). The intention was not to read *a lot* from each table -
we only make a single "Scan" call on each, to read one page of data.
However, the Scan call did not set a Limit, so the single page can get
pretty big.
This is not normally a problem, but in extremely slow runs - such as when
running the debug build on an extremely overcommitted test machine (e.g.,
issue #7706) reading this large page may take longer than our default
timeout. I'll send a separate patch for the timeout issue, but for now,
there is really no reason why we need to read a big page. It is good
enough to just read 50 rows (with Limit=50). This will still read all
the different types and make the test faster.
As an example, in the debug run on my laptop, this test spent 2.4
seconds to read the "compaction_history" table before this patch,
and only 0.1 seconds after this patch. 2.4 seconds is close to our
default timeout (10 seconds), 0.1 is very far.
Fixes#7706
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207075112.2548178-1-nyh@scylladb.com>
The original goal of this patch was to replace the two single-node dtests
allow_filtering_test and allow_filtering_secondary_indexes_test, which
recently caused us problems when we wanted to change the ALLOW FILTERING
behavior but the tests were outside the tree. I'm hoping that after this
patch, those two tests could be removed from dtest.
But this patch actually tests more cases then those original dtest, and
moreover tests not just whether ALLOW FILTERING is required or not, but
also that the results of the filtering is correct.
Currently, four of the included tests are expected to fail ("xfail") on
Scylla, reproducing two issues:
1. Refs #5545:
"WHERE x IN ..." on indexed column x wrongly requires ALLOW FILTERING
2. Refs #7608:
"WHERE c=1" on clustering key c should require ALLOW FILTERING, but
doesn't.
All tests, except the one for issue #5545, pass on Cassandra. That one
fails on Cassandra because doesn't support IN on an indexed column at all
(regardless of whether ALLOW FILTERING is used or not).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201115124631.1224888-1-nyh@scylladb.com>
There is a typo in schema.cql of snapshot, lack of comma after
compaction strategy. It will fail to restore schema by the file.
AND compaction = {'class': 'SizeTieredCompactionStrategy''max_compaction_threshold': '32'}
map_as_cql_param() function has a `first` parameter to smartly add
comma, the compaction_strategy_options is always not the first.
Fixes#7741
Signed-off-by: Amos Kong <amos@scylladb.com>
Closes#7734
"
The storage service is called there to get the cached value
of db::system_keyspace::get_local_host_id(). Keeping the value
on database decouples it from storage service and kills one
more global storage service reference.
tests: unit(dev)
"
* 'br-remove-storage-service-from-counters-2' of https://github.com/xemul/scylla:
counters: Drop call to get_local_storage_service and related
counters: Use local id arg in transform_counter_update_to_shards
database: Have local id arg in transform_counter_updates_to_shards()
storage_service: Keep local host id to database
This PR adds the Sphinx documentation generator and the custom theme ``sphinx-scylladb-theme``. Once merged, the GitHub Actions workflow should automatically publish the developer notes stored under ``docs`` directory on http://scylladb.github.io/scylla
1. Run the command ``make preview`` from the ``docs`` directory.
3. Check the terminal where you have executed the previous command. It should not raise warnings.
3. Open in a new browser tab http://127.0.0.1:5500/ to see the generated documentation pages.
The table of contents displays the files sorted as they appear on GitHub. In a subsequent iteration, @lauranovich and I will submit an additional PR proposing a new folder organization structure.
Closes#7752
* github.com:scylladb/scylla:
docs: fixed warnings
docs: added theme
The previous way of deleting records based on the whole
sstatble data_size causes overzealous deletions (#7668)
and inefficiency in the rows cache due to the large number
of range tombstones created.
Therefore we'd be better of by juts letting the
records expire using he 30 days TTL.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201206083725.1386249-1-bhalevy@scylladb.com>
This reverts commit 0aa1f7c70a, reversing
changes made to 72c59e8000. The diff is
strange, including unrelated commits. There is no understanding of the
cause, so to be safe, revert and try again.
The local host id is now passed by argument, so we don't
need the counter_id::local() and some other methods that
call or are called by it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Only few places in it need the uuid. And since it's only 16 bytes
it's possibvle to safely capture it by value in the called lambdas.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two places that call it -- database code itself and
tests. The former already has the local host id, so just pass
one.
The latter are a bit trickier. Currently they use the value from
storage_service created by storage_service_for_tests, but since
this version of service doesn't pass through prepare_to_join()
the local_host_id value there is default-initialized, so just
default-initialize the needed argument in place.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The value in question is cached from db::system_keyspace
for places that want to have it without waiting for
futures. So far the only place is database counters code,
so keep the value on database itself. Next patches will
make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Citing #6138: > In the past few years we have converted most of our codebase to
work in terms of fragmented buffers, instead of linearised ones, to help avoid
large allocations that put large pressure on the memory allocator. > One
prominent component that still works exclusively in terms of linearised buffers
is the types hierarchy, more specifically the de/serialization code to/from CQL
format. Note that for most types, this is the same as our internal format,
notable exceptions are non-frozen collections and user types. > > Most types
are expected to contain reasonably small values, but texts, blobs and especially
collections can get very large. Since the entire hierarchy shares a common
interface we can either transition all or none to work with fragmented buffers.
This series gets rid of intermediate linearizations in deserialization. The next
steps are removing linearizations from serialization, validation and comparison
code.
Series summary:
- Fix a bug in `fragmented_temporary_buffer::view::remove_prefix`. (Discovered
while testing. Since it wasn't discovered earlier, I guess it doesn't occur in
any code path in master.)
- Add a `FragmentedView` concept to allow uniform handling of various types of
fragmented buffers (`bytes_view`, `temporary_fragmented_buffer::view`,
`ser::buffer_view` and likely `managed_bytes_view` in the future).
- Implement `FragmentedView` for relevant fragmented buffer types.
- Add helper functions for reading from `FragmentedView`.
- Switch `deserialize()` and all its helpers from `bytes_view` to
`FragmentedView`.
- Remove `with_linearized()` calls which just became unnecessary.
- Add an optimization for single-fragment cases.
The addition of `FragmentedView` might be controversial, because another concept
meant for the same purpose - `FragmentRange` - is already used. Unfortunately,
it lacks the functionality we need. The main (only?) thing we want to do with a
fragmented buffer is to extract a prefix from it and `FragmentRange` gives us no
way to do that, because it's immutable by design. We can work around that by
wrapping it into a mutable view which will track the offset into the immutable
`FragmentRange`, and that's exactly what `linearizing_input_stream` is. But it's
wasteful. `linearizing_input_stream` is a heavy type, unsuitable for passing
around as a view - it stores a pair of fragment iterators, a fragment view and a
size (11 words) to conform to the iterator-based design of `FragmentRange`, when
one fragment iterator (4 words) already contains all needed state, just hidden.
I suggest we replace `FragmentRange` with `FragmentedView` (or something
similar) altogether.
Refs: #6138Closes#7692
* github.com:scylladb/scylla:
types: collection: add an optimization for single-fragment buffers in deserialize
types: add an optimization for single-fragment buffers in deserialize
cql3: tuples: don't linearize in in_value::from_serialized
cql3: expr: expression: replace with_linearize with linearized
cql3: constants: remove unneeded uses of with_linearized
cql3: update_parameters: don't linearize in prefetch_data_builder::add_cell
cql3: lists: remove unneeded use of with_linearized
query-result-set: don't linearize in result_set_builder::deserialize
types: remove unneeded collection deserialization overloads
types: switch collection_type_impl::deserialize from bytes_view to FragmentedView
cql3: sets: don't linearize in value::from_serialized
cql3: lists: don't linearize in value::from_serialized
cql3: maps: don't linearize in value::from_serialized
types: remove unused deserialize_aux
types: deserialize: don't linearize tuple elements
types: deserialize: don't linearize collection elements
types: switch deserialize from bytes_view to FragmentedView
types: deserialize tuple types from FragmentedView
types: deserialize set type from FragmentedView
types: deserialize map type from FragmentedView
types: deserialize list type from FragmentedView
types: add FragmentedView versions of read_collection_size and read_collection_value
types: deserialize varint type from FragmentedView
types: deserialize floating point types from FragmentedView
types: deserialize decimal type from FragmentedView
types: deserialize duration type from FragmentedView
types: deserialize IP address types from FragmentedView
types: deserialize uuid types from FragmentedView
types: deserialize timestamp type from FragmentedView
types: deserialize simple date type from FragmentedView
types: deserialize time type from FragmentedView
types: deserialize boolean type from FragmentedView
types: deserialize integer types from FragmentedView
types: deserialize string types from FragmentedView
types: remove unused read_simple_opt
types: implement read_simple* versions for FragmentedView
utils: fragmented_temporary_buffer: implement FragmentedView for view
utils: fragment_range: add single_fragmented_view
serializer: implement FragmentedView for buffer_view
utils: fragment_range: add linearized and with_linearized for FragmentedView
utils: fragment_range: add FragmentedView
utils: fragmented_temporary_buffer: fix view::remove_prefix
Values usually come in a single fragment, but we pay the cost of fragmented
deserialization nevertheless: bigger view objects (4 words instead of 2 words)
more state to keep updated (i.e. total view size in addition to current fragment
size) and more branches.
This patch adds a special case for single-fragment buffers to
abstract_type::deserialize. They are converted to a single_fragmented_view
before doing anything else. Templates instantiated with single_fragmented_view
should compile to better code than their multi-fragmented counterparts. If
abstract_type::deserialize is inlined, this patch should completely prevent any
performance penalties for switching from with_linearized to fragmented
deserialization.
with_linearized creates an additional internal `bytes` when the input is
fragmented. linearized copies the data directly to the output `bytes`, so it's
more efficient.
Devirtualizes collection_type_impl::deserialize (so it can be templated) and
adds a FragmentedView overload. This will allow us to deserialize collections
with explicit cql_serialization_format directly from fragmented buffers.
The final part of the transition of deserialize from bytes_view to
FragmentedView.
Adds a FragmentedView overload to abstract_type::deserialize and
switches deserialize_visitor from bytes_view to FragmentedView, allowing
deserialization of all types with no intermediate linearization.
The partition builder doesn't expect the looked-up row to exist. In fact
it already existing is a sign of a bug. Currently bugs resulting in
duplicate rows will manifest by tripping an assert in
`row::append_cell()`. This however results in poor diagnostics, so we
want to catch these errors sooner to be able to provide higher level
diagnostics. To this end, switch to the freshly introduced
`append_clustering_row()` so that duplicate rows are found early and in
a context where their identity is known.
This abstraction is used to merge the output of multiple readers, each
opened for a single partition query, into a non-decreasing stream
of mutation_fragments.
It is similar to `mutation_reader_merger`,
but an important difference is that the new merger may select new readers
in the middle of a partition after it already returned some fragments
from that partition. It uses the new `position_reader_queue` abstraction
to select new readers. It doesn't support multi-partition (ring range) queries.
The new merger will be later used when reading from sstable sets created
by TimeWindowCompactionStrategy. This strategy creates many sstables
that are mostly disjoint w.r.t the contained clustering keys, so we can
delay opening sstable readers when querying a partition until after we have
processed all mutation fragments with positions before the keys
contained by these sstables.
A microbenchmark was added that compares the existing combining reader
(which uses `mutation_reader_merger` underneath) with a new combining reader
built using the new `clustering_order_reader_merger` and a simple queue of readers
that returns readers from some supplied set. The used set of readers is built from the following
ranges of keys (each range corresponds to a single reader):
`[0, 31]`, `[30, 61]`, `[60, 91]`, `[90, 121]`, `[120, 151]`.
The microbenchmark runs the reader and divides the result by the number of mutation fragments.
The results on my laptop were:
```
$ build/release/test/perf/perf_mutation_readers -t clustering_combined.* -r 10
single run iterations: 0
single run duration: 1.000s
number of runs: 10
test iterations median mad min max
clustering_combined.ranges_generic 2911678 117.598ns 0.685ns 116.175ns 119.482ns
clustering_combined.ranges_specialized 3005618 111.015ns 0.349ns 110.063ns 111.840ns
```
`ranges_generic` denotes the existing combining reader, `ranges_specialized` denotes the new reader.
Split from https://github.com/scylladb/scylla/pull/7437.
Closes#7688
* github.com:scylladb/scylla:
tests: mutation_source_test for clustering_order_reader_merger
perf: microbenchmark for clustering_order_reader_merger
mutation_reader_test: test clustering_order_reader_merger in memory
test: generalize `random_subset` and move to header
mutation_reader: introduce clustering_order_reader_merger
In issue #7722, it was suggested that we should port Cassandra's CQL unit
tests into our own repository, by translating the Java tests into Python
using the new cql-pytest framework. Cassandra's CQL unit test framework is
orders of magnitude faster than dtest, and in-tree, so Cassandra have been
moving many CQL correctness tests there, and we can also benefit from their
test cases.
In this patch, we take the first step in a long journey:
1. I created a subdirectory, test/cql-pytest/cassandra_tests, where all the
translated Cassandra tests will reside. The structure of this directory
will mirror that of the test/unit/org/apache/cassandra/cql3 directory in
the Cassandra repository.
pytest conveniently looks for test files recursively, so when all the
cql-pytest are run, the cassandra_tests files will be run as well.
As usual, one can also run only a subset of all the tests, e.g.,
"test/cql-pytest/run -vs cassandra_tests" runs only the tests in the
cassandra_tests subdirectory (and its subdirectories).
2. I translated into Python two of the smallest test files -
validation/entities/{TimeuuidTest,DataTypeTest}.java - containing just
three test functions.
The plan is to translate entire Java test files one by one, and to mirror
their original location in our own repository, so it will be easier
to remember what we already translated and what remains to be done.
3. I created a small library, porting.py, of functions which resemble the
common functions of the Java tests (CQLTester.java). These functions aim
to make porting the tests easier. Despite the resemblence, the ported code
is not 100% identical (of course) and some effort is still required in
this porting. As we continue this porting effort, we'll probably need
more of these functions, can can also continue to improve them to reduce
the porting effort.
Refs #7722.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201201192142.2285582-1-nyh@scylladb.com>
This series introduces a `large_data_counters` element to `scylla_metadata` component to explicitly count the number of `large_{partitions,rows,cells}` and `too_many_rows` in the sstable. These are accounted for in the sstable writer whenever the respective large data entry is encountered.
It is taken into account in `large_data_handler::maybe_delete_large_data_entries`, when engaged.
Otherwise, if deleting a legacy sstable that has no such entry in `scylla_metadata`, just revert to using the current method of comparing the sstable's `data_size` to the various thresholds.
Fixes#7668
Test: unit(dev)
Dtest: wide_rows_test.py (in progress)
Closes#7669
* github.com:scylladb/scylla:
docs: sstable-scylla-format: add large_data_stats subcomponent
large_data_handler: maybe_delete_large_data_entries: use sstable large data stats
large_data_handler: maybe_delete_large_data_entries: accept shared_sstable
large_data_handler: maybe_delete_large_data_entries: move out of line
sstables: load large_data_stats from scylla_metadata
sstables: store large_data_stats in scylla_metadata
sstables: writer: keep track of large data stats
large_data_handler: expose methods to get threshold
sstables: kl/writer: never record too many rows
large_data_handler: indicate recording of large data entries
large_data_handler: move constructor out of line
In test/cql-pytest/run.py we have a 200 second timeout to boot Scylla.
I never expected to reach this timeout - it normally takes (in dev
build mode) around 2 seconds, but in one run on Jenkins we did reach it.
It turns out that the code does not recognize this timeout correctly,
thought that Scylla booted correctly - and then failed all the
subtests when they fail to connect to Scylla.
This patch fixes the timeout logic. After the timeout, if Scylla's
CQL port is still not responsive, the test run is failed - without
trying to run many individual tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201201150927.2272077-1-nyh@scylladb.com>
When a row was inserted into a table with no regular columns, and no
such row existed in the first place, postimage would not be produced.
Fix this.
Fixes#7716.
Closes#7723
If the sstable has scylla_metadata::large_data_stats use them
to determine whether to delete the corresponding large data records.
Otherwise, defer to the current method of comparing the sstable
data_size to the respective thresholds.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since the actual deletion if the large data entries
is done in the background, and we don't captures the shared_sstable,
we can safely pass it to maybe_delete_large_data_entries when
deleting the sstable in sstable::unlink and it will be release
as soon as maybe_delete_large_data_entries returns.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Load the large data stats from the scylla_metadata component
if they are present. Otherwise, if we're opening a legacy sstable
that has scylla_metadata_type::LargeDataStats, leave
sstable::_large_data_stats disengaged.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Store the large data statistics in the scylla_metadata component.
These will be retrieved when loading the sstable and be
used for determining whether to delete the corresponding
large data entries upon sstable deletion.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Previously, statement_restrictions::find_idx() would happily return an
index for a non-EQ restriction (because it checked only the column
name, not the operator). This is incorrect: when the selected index
is for a non-EQ restriction, it is impossible to query that index
table.
Fixes#7659.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7665
* seastar 010fb0df1e...8b400c7b45 (6):
> append_challenged_posix_file_impl::read_dma: allow iovec to cross _logical_size
> Merge "Extend per task-queue timing statistics" from Pavel E
> tls_test: Create test certs at build time
> cook: upgrade hwloc version
> memory: rate-limit diagnostics messages
> util/log: add rate-limited version of writer version of log()
Currently, if the user provides a cell name with too many components,
we will accept it and construct an invalid clusterin key. This may
result in undefined behavior down the stream.
It was caught by ASAN in a debug build when executing dtest
cql_tests.py:MiscellaneousCQLTester.cql3_insert_thrift_test with
nodetool flush manually added after the write. Triggered during
sstable writing to an MC-format sstable:
seastar::shared_ptr<abstract_type const>::operator*() const at ././seastar/include/seastar/core/shared_ptr.hh:577
sstables::mc::clustering_blocks_input_range::next() const at ./sstables/mx/writer.cc:180
To prevent corrupting the state in this way, we should fail
early. This patch addds validation which will fail thrift requests
which attempt to create invalid clustering keys.
Fixes#7568.
Example error:
Internal server error: Cell name of ks.test has too many components, expected 1 got 2 in 0x0004000000040000017600
Message-Id: <1605550477-24810-1-git-send-email-tgrabiec@scylladb.com>
This patch adds an option to scylla_setup to configure an rsyslog destination.
The monitoring stack has an option to get information from rsyslog it
requires that rsyslog on the scylla machines will send the trace line to
it.
The configuration will be in a Scylla configuration file, so it is safe to run it multiple times.
Fixes#7589
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#7634
* github.com:scylladb/scylla:
dist/common/scripts/scylla_setup: Optionally config rsyslog destination
Adding dist/common/scripts/scylla_rsyslog_setup utility
This patch adds an option to scylla_setup to configure an rsyslog
destination.
The monitoring stack has an option to get information from rsyslog, it
requires that rsyslog on the scylla machines will send the trace line to
it.
If the /etc/rsyslog.d/ directory exists (that means the current system
runs rsyslog) it will ask if to add rsyslog configuration and if yes, it
would run scylla_rsyslog_setup.
Fixes#7589
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
scylla_setup command suggestion does not shows an argument of --io-setup,
because we mistakely stores bool value on it (recognized as 'store_true').
We always need to print '--io-setup X' on the suggestion instead.
Related #7395
* scylla-dev/snapshot_fixes_v1:
raft: ignore append_reply from a peer in SNAPSHOT state
raft: Ignore outdated snapshots
raft: set next_idx to correct value after snapshot transfer
This abstraction is used to merge the output of multiple readers, each
opened for a single partition query, into a non-decreasing stream
of mutation_fragments.
It is similar to `mutation_reader_merger`,
an important difference is that the new merger may select new readers
in the middle of a partition after it already returned some fragments
from that partition. It uses the new `position_reader_queue` abstraction
to select new readers. It doesn't support multi-partition (ring range) queries.
The new merger will be later used when reading from sstable sets created
by TimeWindowCompactionStrategy. This strategy creates many sstables
that are mostly disjoint w.r.t the contained clustering keys, so we can
delay opening sstable readers when querying a partition until after we have
processed all mutation fragments with positions before the keys
contained by these sstables.
Fix#7680 by never using secondary index for multi-column restrictions.
Modify expr::is_supported_by() to handle multi-column correctly.
Tests: unit (dev)
Closes#7699
* github.com:scylladb/scylla:
cql3/expr: Clarify multi-column doesn't use indexing
cql3: Don't use index for multi-column restrictions
test: Add eventually_require_rows
The first two patches in this series are small improvements to cql-pytest to prepare for the third and main patch. This third patch adds cql-pytest tests which check that we fail CQL queries that try to inject non-ASCII and non-UTF-8 strings for ascii and text columns, respectively.
The tests do not discover any unknown bug in Scylla, however, they do show that Scylla is more strict in its definition of "valid UTF-8" compared to Cassandra.
Closes#7719
* github.com:scylladb/scylla:
test/cql-pytest: add tests for validation of inserted strings
test/cql-pytest: add "scylla_only" fixture
test/cpy-pytest: enable experimental features
This change adds tracking of all the CQL errors that can be
raised in response to a CQL message from a client, as described
in the CQL v4 protocol and with Scylla's CDC_WRITE_FAILUREs
included.
Fixes#5859Closes#7604
We have "Conflicts: kernel < 3.10.0-514" on rpm package to make sure
the environment is running newer kernel.
However, user may use non-standard kernel which has different package name,
like kernel-ml or kernel-uek.
On such environment Conflicts tag does not works correctly.
Even the system running with newer kernel, rpm only checks "kernel" package
version number.
To avoid such issue, we need to drop Conflicts tag.
Fixes#7675
This patch adds comprehensive cql-pytest tests for checking the validation
of strings - ASCII or UTF-8 - in CQL. Strings can be represented in CQL
using several methods - a strings can be a string literal as
part of the statement, can be encoded as a blob (0x...), or
can be a binding parameter for a prepared statement, or returned
by user-defined functions - and these tests check all of them.
We already have low-level unit tests for UTF-8 parsing in
test/boost/utf8_test.cc, but the new tests here confirms that we really
call these low-level functions in the correct way. Moreover, since these
are CQL tests, they can also be run against Cassandra, and doing that
demonstrated that Scylla's UTF-8 parsing is *stricter* than Cassandra's -
Scylla's UTF-8 parser rejects the following sequences which Cassandra's
accepts:
1. \xC0\x80 as another non-minimal representation of null. Note that other
non-minimal encodings are rejected by Cassandra, as expected.
2. Characters beyond the official Unicode range (or what Scylla considers
the end of the range).
3. UTF-16 surrogates - these are not considered valid UTF-8, but Cassandra
accepts them, and Scylla does not.
In the future, we should consider whether Scylla is more correct than
Cassandra here (so we're fine), or whether compatibility is more important
than correctness (so this exposed a bug).
The ASCII tests reproduces issue #5421 - that trying to insert a
non-ASCII string into an "ascii" column should produce an error on
insert - not later when fetching the string. This test now passes,
because issue 5421 was already fixed.
These tests did not exposed any bug in Scylla (other than the differences
with Cassandra mentioned a bug), so all of them pass on Scylla. Two
of the tests fail on Cassandra, because Cassandra does not recognize
some invalid UTF-8 (according to Scylla's definition) as invalid.
Refs #5421.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reject the previously accepted case where the multi-column restriction
applied to just a single column, as it causes a crash downstream. The
user can drop the parentheses to avoid the rejection.
Fixes#7710
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7712
"
This series adds maybe_yield called from
cleanup_compaction::get_ranges_for_invalidation
to avoid reactor stalls.
To achieve that, we first extract bool_class can_yield
to utils/maybe_yield.hh, and add a convience helper:
utils::maybe_yield(can_yield) that conditionally calls
seastar::thread::maybe_yield if it can (when called in a
seastar thread).
With that, we add a can_yield parameter to dht::to_partition_ranges
and dht::partition_range::deoverlap (defaults to false), and
use it from cleanup_compaction::get_ranges_for_invalidation,
as the latter is always called from `consume_in_thread`.
Fixes#7674
Test: unit(dev)
"
* tag 'unstall-get_ranges_for_invalidation-v2' of github.com:bhalevy/scylla:
compaction: cleanup_compaction: get_ranges_for_invalidation: add yield points
dht/i_partitioner: to_partition_ranges: support yielding
locator: extract can_yield to utils/maybe_yield.hh
It is used to force remove a node from gossip membership if something
goes wrong.
Note: run the force_remove_endpoint api at the same time on _all_ the
nodes in the cluster in order to prevent the removed nodes come back.
Becasue nodes without running the force_remove_endpoint api cmd can
gossip around the removed node information to other nodes in 2 *
ring_delay (2 * 30 seconds by default) time.
For instance, in a 3 nodes cluster, node 3 is decommissioned, to remove
node 3 from gossip membership prior the auto removal (3 days by
default), run the api cmd on both node 1 and node 2 at the same time.
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.1:10000/gossiper/force_remove_endpoint/127.0.0.3"
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.2:10000/gossiper/force_remove_endpoint/127.0.0.3"
Then run 'nodetool gossipinfo' on all the nodes to check the removed nodes
are not present.
Fixes#2134Closes#5436
This patch adds a fixture "scylla_only" which can be used to mark tests
for Scylla-specific features. These tests are skipped when running against
other CQL servers - like Apache Cassandra.
We recognize Scylla by looking at whether any system table exists with
the name "scylla" in its name - Scylla has several of those, and Cassandra
has none.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
bytes_view is one of the types we want to deserialize from (at least for now),
so we want to be able to pass it to deserialize() after it's transitioned to
FragmentView.
single_fragmented_view is a wrapper implementing FragmentedView for bytes_view.
It's constructed from bytes_view explicitly, because it's typically used in
context where we want to phase linearization (and by extension, bytes_view) out.
This patch introduces FragmentedView - a concept intented as a general-purpose
interface for fragmented buffers.
Another concept made for this purpose, FragmentedRange, already exists in the
codebase. However, it's unwieldy. The iterator-based design of FragmentRange is
harder to implement and requires more code, but more importantly it makes
FragmentRange immutable.
Usually we want to read the beginning of the buffer and pass the rest of it
elsewhere. This is impossible with FragmentRange.
FragmentedView can do everything FragmentRange can do and more, except for
playing nicely with iterator-based collection methods, but those are useless for
fragmented buffers anyway.
disk parsing expects output from recursive listing of GCP
metadata REST call, the method used to do it by default,
but now it requires a boolean flag to run in recursive mode
Fixes#7684Closes#7685
Since f3bcd4d205 ("Merge 'Support SSL Certificate Hot
Reloading' from Calle"), we reload certificates as they are
modified on disk. This uses inotify, which is limited by a
sysctl fs.inotify.max_user_instances, with a default of 128.
This is enough for 64 shards only, if both rpc and cql are
encrypted; above that startup fails.
Increase to 1200, which is enough for 6 instances * 200 shards.
Fixes#7700.
Closes#7701
When we introduced dependencies.conf, we mistakenly added it on rpm as %ghost,
but it should be normal file, should be installed normally on package installation.
Fixes#7703Closes#7704
Fixes#7211
If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
Closes#7697
* github.com:scylladb/scylla:
redis::service: Shut down sharded<> subobject on startup exception
transport::controller: Shut down distributed object on startup exception
Refs #7211
If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
Fixes#7211
If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
The downstream code expects a single-column restriction when using an
index. We could fix it, but we'd still have to filter the rows
fetched from the index table, unlike the code that queries the base
table directly. For instance, WHERE (c1,c2,c3) = (1,2,3) with an
index on c3 can fetch just the right rows from the base table but all
the c3=3 rows from the index table.
Fixes#7680
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
After snapshot is transferred progress::next_idx is set to its index,
but the code uses current snapshot to set it instead of the snapshot
that was transferred. Those can be different snapshots.
After a node becomes leader it needs to do two things: send an append
message to establish its leadership and commit one entry to make sure
all previous entries with smaller terms are committed as well.
Snapshot index cannot be used to check snapshot correctness since some
entries may not be command and thus do not affect snapshot value. Lest
use applied entries count instead.
Move the definition of bool_class can_yield to a standalone
header file and define there a maybe_yield(can_yield) helper.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In commit 9b28162f88 (repair: Use label
for node ops metrics), we switched to use label for different node
operations. We should use the same description for the same metric name.
Fixes#7681Closes#7682
1. sstables: move `sstable_set` implementations to a separate module
All the implementations were kept in sstables/compaction_strategy.cc
which is quite large even without them. `sstable_set` already had its
own header file, now it gets its own implementation file.
The declarations of implementation classes and interfaces (`sstable_set_impl`,
`bag_sstable_set`, and so on) were also exposed in a header file,
sstable_set_impl.hh, for the purposes of potential unit testing.
2. mutation_reader: move `mutation_reader::forwarding` to flat_mutation_reader.hh
Files which need this definition won't have to include
mutation_reader.hh, only flat_mutation_reader.hh (so the inclusions are
in total smaller; mutation_reader.hh includes flat_mutation_reader.hh).
3. sstables: move sstable reader creation functions to `sstable_set`
Lower level functions such as `create_single_key_sstable_reader`
were made methods of `sstable_set`.
The motivation is that each concrete sstable_set
may decide to use a better sstable reading algorithm specific to the
data structures used by this sstable_set. For this it needs to access
the set's internals.
A nice side effect is that we moved some code out of table.cc
and database.hh which are huge files.
4. sstables: pass `ring_position` to `create_single_key_sstable_reader`
instead of `partition_range`.
It would be best to pass `partition_key` or `decorated_key` here.
However, the implementation of this function needs a `partition_range`
to pass into `sstable_set::select`, and `partition_range` must be
constructed from `ring_position`s. We could create the `ring_position`
internally from the key but that would involve a copy which we want to
avoid.
5. sstable_set: refactor `filter_sstable_for_reader_by_pk`
Introduce a `make_pk_filter` function, which given a ring position,
returns a boolean function (a filter) that given a sstable, tells
whether the sstable may contain rows with the given position.
The logic has been extracted from `filter_sstable_for_reader_by_pk`.
Split from #7437.
Closes#7655
* github.com:scylladb/scylla:
sstable_set: refactor filter_sstable_for_reader_by_pk
sstables: pass ring_position to create_single_key_sstable_reader
sstables: move sstable reader creation functions to `sstable_set`
mutation_reader: move mutation_reader::forwarding to flat_mutation_reader.hh
sstables: move sstable_set implementations to a separate module
This piece of logic was wrong for two unrelated reasons:
1. When fragmented_temporary_buffer::view is constructed from bytes_view,
_current is null. When remove_prefix was used on such view, null pointer
dereference happened.
2. It only worked for the first remove_prefix call. A second call would put a
wrong value in _current_position.
For sstable versions greater or equal than md, the `min_max_column_names`
sstable metadata gives a range of position-in-partitions such that all
clustering rows stored in this sstable have positions in this range.
Partition tombstones in this context are understood as covering the
entire range of clustering keys; thus, if the sstable contains at least
one partition tombstone, the sstable position range is set to be the
range of all clustered rows.
Therefore, by checking that the position range is *not* the range of all
clustered rows we know that the sstable cannot have any partition tombstones.
Closes#7678
It is not legal to fast forward a reader before it enters a partition.
One must ensure that there even is a partition in the first place. For
this one must fetch a `partition_start` fragment.
Closes#7679
Fixes, features needed for testing, snapshot testing.
Free election after partitioning (replication test) .
* https://github.com/alecco/scylla/tree/raft-ale-tests-05e:
raft: replication test: partitioning with leader
raft: replication test: run free election after partitioning
raft: expose fsm tick() to server for testing
raft: expose is_leader() for testing
raft: replication test: test take and load snapshot
raft: fix a bug in leader election
raft: fix default randomized timeout
raft: replication test: fix custom next leader
raft: replication test: custom next leader noop for same
raft: replication test: fix failure detector for disconnected
Introduce a `make_pk_filter` function, which given a ring position,
returns a boolean function (a filter) that given a sstable, tells
whether the sstable may contain rows with the given position.
The logic has been extracted from `filter_sstable_for_reader_by_pk`.
instead of partition_range.
It would be best to pass `partition_key` or `decorated_key` here.
However, the implementation of this function needs a `partition_range`
to pass into `sstable_set::select`, and `partition_range` must be
constructed from `ring_position`s. We could create the `ring_position`
internally from the key but that would involve a copy which we want to
avoid.
Currently, each internal page fetched during aggregating
gets a timeout based on the time the page fetch was started,
rather than the query start time. This means the query can
continue processing long after the client has abandoned it
due to its own timeout, which is based on the query start time.
Fix by establishing the timeout once when the query starts, and
not advancing it.
Test: manual (SELECT count(*) FROM a large table).
Fixes#1175.
Closes#7662
The C and C++ sub-builds were placed in submodule_pool to
reduce concurrency, as they are memory intensive (well, at least
the C++ jobs are), and we choose build concurrency based on memory.
But the other submodules are not memory intensives, and certainly
the packaging jobs are not (and they are single-threaded too).
To allow these simple jobs to utilize multicores more efficiently,
remove them from submodule_pool so they can run in parallel.
Closes#7671
The unified package is quite large (1GB compressed), and it
is the last step in the build so its build time cannot be
parallized with other tasks. Compress it with pigz to take
advantage of multiple cores and speed up the build a little.
Closes#7670
We initially implemented run() and out() functions because we couldn't use
subprocess.run() since we were on Python 3.4.
But since we moved to relocatable python3, we don't need to implement it ourselves.
Why we keep using these functions are, because we needed to set environemnt variable to set PATH.
Since we recently moved away these codes to python thunk, we finally able to
drop run() and out(), switch to subprocess.run().
When partitioning without keeping the existing leader, run an election
without forcing a particular leader.
To force a leader after partitioning, a test can just set it with new_leader{X}.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
For tests to advance servers they need to invoke tick().
This is needed to advance free elections.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Through configuration trigger automatic snapshotting.
For now, handle expected log index within the test's state machine and
pass it with snapshot_value (within the test file).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
If a server responds favourably to RequestVote RPC, it should
reset its election timer, otherwise it has very high chances of becoming
a candidate with an even newer term, despite successful elections.
A candidate with a term larger than the leader rejects AppendEntries
RPCs and can not become a leader itself (because of protection
against of disruptive leaders), so is stuck in this state.
Range after election timeout should start at +1.
This matches existing update_current_term() code adding dist(1, 2*n).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Adjustments after changes due to free election in partitioning and changes in
the code.
Elapse previous leader after isolating it.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
scylla_rsyslog_setup adds a configuration file to rsyslog to forward the
trances to a remote server.
It will override any existing file, so it is safe to run it multiple
times.
It takes an ip, or ip and port from the users for that configuration, if
no port is provided, the default port of Scylla-Monitoring promtail is
used.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Follow-up to https://github.com/scylladb/scylla/pull/6916.
- Fixes wrong usage of `resource_manager::prepare_per_device_limits`,
- Improves locking in `resource_manager` so that it is more safe to call its methods concurrently,
- Adds comments around `resource_manager::register_manager` so that it's more clear what this method does and why.
Closes#7660
* github.com:scylladb/scylla:
hints/resource_manager: add comments to register_manager
hints/resource_manager: fix indentation
hints/resource_manager: improve mutual exclusion
hints/resource_manager: correct prepare_per_device_limits usage
The "ninja dist-server-tar" command is a full replacement for
"build_reloc.sh" script. We release engineering infrastructure has been
switched to ninja, so let's remove "build_reloc.sh" as obsolete.
Now that CDC is GA, it should be enabled in all the tests by default.
To achieve that the PR adds a special db::config::add_cdc_extension()
helper which is used in cql_test_envm to make sure CDC is usable in
all the tests that use cql_test_env.m As a result, cdc_tests can be
simplified.
Finally, some trailing whitespaces are removed from cdc_tests.
Tests: unit(dev)
Closes#7657
* github.com:scylladb/scylla:
cdc: Remove trailing whitespaces from cdc_tests
cdc: Remove mk_cdc_test_config from tests
config: Add add_cdc_extension function for testing
cdc: Add missing includes to cdc_extension.hh
The patch which introduces build-dependent testing
has a regression: it quietly filters out all tests
which are not part of ninja output. Since ninja
doesn't build any CQL tests (including CQL-pytest),
all such tests were quietly disabled.
Fix the regression by only doing the filtering
in unit and boost test suites.
test: dev (unit), dev + --build-raft
Message-Id: <20201119224008.185250-1-kostja@scylladb.com>
Some systems (at least, Centos 7, aarch64) block the membarrier()
syscall via seccomp. This causes Scylla or unit tests to burn cpu
instead of sleeping when there is nothing to do.
Fix by instructing podman/docker not to block any syscalls. I
tested this with podman, and it appears [1] to be supported on
docker.
[1] https://docs.docker.com/engine/security/seccomp/#run-without-the-default-seccomp-profileCloses#7661
Lower level functions such as `create_single_key_sstable_reader`
were made methods of `sstable_set`.
The motivation is that each concrete sstable_set
may decide to use a better sstable reading algorithm specific to the
data structures used by this sstable_set. For this it needs to access
the set's internals.
A nice side effect is that we moved some code out of table.cc
and database.hh which are huge files.
Files which need this definition won't have to include
mutation_reader.hh, only flat_mutation_reader.hh (so the inclusions are
in total smaller; mutation_reader.hh includes flat_mutation_reader.hh).
All the implementations were kept in sstables/compaction_strategy.cc
which is quite large even without them. `sstable_set` already had its
own header file, now it gets its own implementation file.
The declarations of implementation classes and interfaces (`sstable_set_impl`,
`bag_sstable_set`, and so on) were also exposed in a header file,
sstable_set_impl.hh, for the purposes of potential unit testing.
"
The qctx is global object that references query processor and
database to let the rest of the code query system keyspace.
As the first step of de-globalizing it -- remove the database
reference from it. After the set the qctx remains a simple
wrapper over the query processor (which is already de-globalized)
and the query processor in turn is mostly needed only to parse
the query string into prepared statement only. This, in turn,
makes it possible to remove the qctx later by parsing the
query strings on boot and carrying _them_ around, not the qctx
itself.
tests: unit(dev), dtest(simple_cluster_driver_test:dev), manual start/stop
"
* 'br-remove-database-from-qctx' of https://github.com/xemul/scylla:
query-context: Remove database from qctx
schema-tables: Use query processor referece in save_system(_keyspace)?_schema
system-keyspace: Rewrite force_blocking_flush
system-keyspace: Use cluster_name string in check_health
system-keyspace: Use db::config in setup_version
query-context: Kill global helpers
test: Use cql_test_env::evecute_cql instead of qctx version
code: Use qctx::evecute_cql methods, not global ones
system-keyspace: Do not call minimal_setup for the 2nd time
system-keyspace: Fix indentation after previous patch
system-keyspace: Do not do invoke_on_all by hands
system-keyspace: Remove dead code
The save_system_schema and save_system_keyspace_schema are both
called on start and can the needed get query processor reference
from arguments.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method is called after query_processor::execute_internal
to flush the cf. Encapsulating this flush inside database and
getting the database from query_processor lets removing
database reference from global qctx object.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The check_help needs global qctx to get db.config.cluster_name,
which is already available at the caller side.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the beginning of de-globalizing global qctx thing.
The setup_version() needs global qctx to get config from.
It's possible to get the config from the caller instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similar to previous patch, but for tests. Since cql_test_env
does't have qctx on board, the patch makes one step forward
and calls what is called by qctx::execute_cql.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are global db::execute_cql() helpers that just forward
the args into qctx::execute_cql(). The former are going away,
so patch all callers to use qctx themselves.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
THe system_keyspace::minimal_setup is called by main.cc by hands
already, some steps before the regular ::setup().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The cache_truncation_record needs to run cf.cache_truncation_record
on each shard's DB, so the invoke_on_all can be used.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This commit causes start, stop and register_manager methods of the
resource_manager to be serialized with respect to each other using the
_operation_lock.
Those function modify internal state, so it's best if they are
protected with a semaphore. Additionally, those function are not going
to be used frequently, therefore it's perfectly fine to protect them in
such a coarse manner.
Now, space_watchdog has a dedicated lock for serializing its on_timer
logic with resource_manager::register_manager. The reason for separate
lock is that resource_manager::stop cannot use the same lock as the
space_watchdog - otherwise a situation could occur in which
space_watchdog waits for semaphore units held by
resource_manager::stop(), and resource_manager::stop() waits until the
space_watchdog stops its asynchronous event loop.
The resource_manager::prepare_per_device_limits function calculates disk
quota for registered hints managers, and creates an association map:
from a storage device id to those hints manager which store hints on
that device (_per_device_limits_map)
This function was used with an assumption that it is idempotent - which
is a wrong assumption. In resource_manager::register_manager, if the
resource_manager is already started, prepare_per_device_limits would be
called, and those hints managers which were previously added to the
_per_device_limits_map would be added again. This would cause the space
used by those managers to be calculated twice, which would artificially
lower the limit which we impose on the space hints are allowed to occupy
on disk.
This patch fixes this problem by changing the prepare_per_device_limits
function to operate on a hints manager passed by argument. Now, we make
sure that this function is called on each hints manager only once.
Now that CDC is GA and enabled by default, there's no longer a need
for a specific config in CDC tests.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It is now called `merging_reader`, and is used to change a `FragmentProducer`
that produces a non-decreasing stream of mutation fragments batches into
a `flat_mutation_reader` producing a non-decreasing stream of fragments.
The resulting stream of fragments is increasing except for places where
we encounter range tombstones (multiple range tombstones may be produced
with the same position_in_partition)
`merging_reader` is a simple adapter over `mutation_fragment_merger`.
The old `combined_mutation_reader` is simply a specialization of `merging_reader`
where the used `FragmentProducer` is `mutation_reader_merger`, an abstraction that
merges the output of multiple readers into one non-decreasing stream of fragment
batches.
There is no separate class for `combined_mutation_reader` now. Instead,
`make_combined_reader` works directly with `merging_reader`.
The PR also improves some comments.
Split from https://github.com/scylladb/scylla/pull/7437.
Closes#7656
* github.com:scylladb/scylla:
mutation_reader: `generalize combined_mutation_reader`
mutation_reader: fix description of mutation_fragment_merger
After the concept of the seed nodes was removed we can distinguish
whether the node is the first node in the cluster or not.
Thanks to this we can avoid adding delay to the timestamp of the first
CDC generation.
The delay is added to the timestamp to make sure that all the nodes
in the cluster manage to learn about it before the timestamp becomes in the past.
It is safe to not add the delay for the first node because we know it's the only node
in the cluster and no one else has to learn about the timestamp.
Fixes#7645
Tests: unit(dev)
Closes#7654
* github.com:scylladb/scylla:
cdc: Don't add delay to the timestamp of the first generation
cdc: Change for_testing to add_delay in make_new_cdc_generation
It is now called `merging_reader`, and is used to change a `FragmentProducer`
that produces a non-decreasing stream of mutation fragments batches into
a `flat_mutation_reader` producing a non-decreasing stream of fragments.
The resulting stream of fragments is increasing except for places where
we encounter range tombstones (multiple range tombstones may be produced
with the same position_in_partition)
`merging_reader` is a simple adapter over `mutation_fragment_merger`.
The old `combined_mutation_reader` is simply a specialization of `merging_reader`
where the used `FragmentProducer` is `mutation_reader_merger`, an abstraction that
merges the output of multiple readers into one non-decreasing stream of fragment
batches.
There is no separate class for `combined_mutation_reader` now. Instead,
`make_combined_reader` works directly with `merging_reader`.
"
We've recently seen failures in this unit test as follows:
```
test/boost/network_topology_strategy_test.cc(0): Entering test case "testCalculateEndpoints"
unknown location(0): fatal error: in "testCalculateEndpoints": std::out_of_range: _Map_base::at
./seastar/src/testing/seastar_test.cc(43): last checkpoint
test/boost/network_topology_strategy_test.cc(0): Leaving test case "testCalculateEndpoints"; testing time: 15192us
test/boost/network_topology_strategy_test.cc(0): Entering test case "test_invalid_dcs"
network_topology_strategy_test: ./seastar/include/seastar/core/future.hh:634: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.
Aborting on shard 0.
```
This series fixes 2 issues in this test:
1. The core issue where std::out_of_range exception
is not handled in calculate_natural_endpoints().
2. A secondary issue where the static `snitch_inst` isn't
stopped when the first exception is hit, failing
the next time the snitch is started, as it wasn't
stopped properly.
Test: network_topology_strategy_test(release)
"
* tag 'nts_test-harden-calculate_natural_endpoints-v1' of github.com:bhalevy/scylla:
test: network_topology_strategy_test: has_sufficient_replicas: handle empty dc endpoints case
test: network_topology_strategy_test: fixup indentation
test: network_topology_strategy_test: always stop_snitch after create_snitch
After the concept of the seed nodes was removed we can distinguish
whether the node is the first node in the cluster or not.
Thanks to this we can avoid adding delay to the timestamp of the first
CDC generation.
Fixes#7645
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The meaning of the parameter changes from defining whether the function
is called in testing environment to deciding whether a delay should be
added to a timestamp of a newly created CDC generation.
This is a preparation for improvement in the following patch that does
not always add delay to every node but only to non-first node.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Asias He reports that git on Windows filesystem is unhappy about the
colon character (":") present in dist-check files:
$ git reset --hard origin/master
error: invalid path 'tools/testing/dist-check/docker.io/centos:7.sh'
fatal: Could not reset index file to revision 'origin/master'.
Rename the script to use a dash instead.
Closes#7648
Current tests uses hash state machine that checks for specific order of
entries application. The order is not always guaranty though.
Backpressure may delay some entires to be submitted and when they are
released together they may be reordered in the debug mode due to
SEASTAR_SHUFFLE_TASK_QUEUE. Introduce an ability for test to choose
state machine type and implement commutative state machine that does
not care about ordering.
To prevent the log to take too much memory introduce a mechanism that
limits the log to a certain size. If the size is reached no new log
entries can be submitted until previous entries are committed and
snapshotted.
If scylla_raid_setup script called without --raiddev argument
then try to use any of /dev/md[0-9] devices instead of only
one /dev/md0. Do it in this way because on Ubuntu 20.04
/dev/md0 used by OS already.
Closes#7628
gcc fails to compile current master like this
In file included from ./service/client_state.hh:44,
from ./cql3/cql_statement.hh:44,
from ./cql3/statements/prepared_statement.hh:47,
from ./cql3/statements/raw/select_statement.hh:45,
from build/dev/gen/cql3/CqlParser.hpp:64,
from build/dev/gen/cql3/CqlParser.cpp:44:
./auth/service.hh:188:21: error: declaration of ‘const auth::resource& auth::command_desc::resource’ changes meaning of ‘resource’ [-fpermissive]
188 | const resource& resource; ///< Resource impacted by this command.
| ^~~~~~~~
In file included from ./auth/authenticator.hh:57,
from ./auth/service.hh:33,
from ./service/client_state.hh:44,
from ./cql3/cql_statement.hh:44,
from ./cql3/statements/prepared_statement.hh:47,
from ./cql3/statements/raw/select_statement.hh:45,
from build/dev/gen/cql3/CqlParser.hpp:64,
from build/dev/gen/cql3/CqlParser.cpp:44:
./auth/resource.hh:98:7: note: ‘resource’ declared here as ‘class auth::resource’
98 | class resource final {
| ^~~~~~~~
clang doesn't fail
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201118155905.14447-1-xemul@scylladb.com>
If a list of target endpoints for sending view updates contains
duplicates, it results in benign (but annoying) broken promise
errors happening due to duplicated write response handlers being
instantiated for a single endpoint.
In order to avoid such errors, target remote endpoints are deduplicated
from the list of pending endpoints.
A similar issue (#5459) solved the case for duplicated local endpoints,
but that didn't solve the general case.
Fixes#7572Closes#7641
This PR allows changing the hinted_handoff_enabled option in runtime, either by modifying and reloading YAML configuration, or through HTTP API.
This PR also introduces an important change in semantics of hinted_handoff_enabled:
- Previously, hinted_handoff_enabled controlled whether _both writing and sending_ hints is allowed at all, or to particular DCs,
- Now, hinted_handoff_enabled only controls whether _writing hints_ is enabled. Sending hints from disk is now always enabled.
Fixes: #5634
Tests:
- unit(dev) for each commit of the PR
- unit(debug) for the last commit of the PR
Closes#6916
* github.com:scylladb/scylla:
api: allow changing hinted handoff configuration
storage_proxy: fix wrong return type in swagger
hints_manager: implement change_host_filter
storage_proxy: always create hints manager
config: plug in hints::host_filter object into configuration
db/hints: introduce host_filter
hints/resource_manager: allow registering managers after start
hints: introduce db::hints::directory_initializer
directories.cc: prepare for use outside main.cc
Fixes#7064
Iff broadcast address is set to ipv6 from main (meaning prefer
ipv6), determine the "public" ipv6 address (which should be
the same, but might not be), via aws metadata query.
Closes#7633
available_memory is used to seed many caches and controllers. Usually
it's detected from the environment, but unit tests configure it
on their own with fake values. If they forget, then the undefined
behavior sanitizer will kick in in random places (see 8aa842614a
("test: gossip_test: configure database memory allocation correctly")
for an example.
Prevent this early by asserting that available_memory is nonzero.
Closes#7612
std::iterator is deprecated since C++17 so define all the required iterator_traits directly and stop using std::iterator at all.
More context: https://www.fluentcpp.com/2018/05/08/std-iterator-deprecated
Tests: unit(dev)
Closes#7635
* github.com:scylladb/scylla:
log_heap: Remove std::iterator from hist_iterator
types: Remove std::iterator from tuple_deserializing_iterator
types: Remove std::iterator from listlike_partial_deserializing_iterator
sstables: remove std::iterator from const_iterator
token_metadata: Remove std::iterator from tokens_iterator
size_estimates_virtual_reader: Remove std::iterator
token_metadata: Remove std::iterator from tokens_iterator_impl
counters: Remove std::iterator from iterators
compound_compat: Remove std::iterator from iterators
compound: Remove std::iterator from iterator
clustering_interval_set: Remove std::iterator from position_range_iterator
cdc: Remove std::iterator from collection_iterator
cartesian_product: Remove std::iterator from iterator
bytes_ostream: Remove std::iterator from fragment_iterator
We saw this intermittent failure in testCalculateEndpoints:
```
unknown location(0): fatal error: in "testCalculateEndpoints": std::out_of_range: _Map_base::at
```
It turns out that there are no endpoints associated with the dc passed
to has_sufficient_replicas in the `all_endpoints` map.
Handle this case by returning true.
The dc is still required to appear in `dc_replicas`,
so if it's not found there, fail the test gracefully.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently stop_snitch is not called if the test fails on exception.
This causes a failure in create_snitch where snitch_inst fails to start
since it wasn't stopped earlier.
For example:
```
test/boost/network_topology_strategy_test.cc(0): Entering test case "testCalculateEndpoints"
unknown location(0): fatal error: in "testCalculateEndpoints": std::out_of_range: _Map_base::at
./seastar/src/testing/seastar_test.cc(43): last checkpoint
test/boost/network_topology_strategy_test.cc(0): Leaving test case "testCalculateEndpoints"; testing time: 15192us
test/boost/network_topology_strategy_test.cc(0): Entering test case "test_invalid_dcs"
network_topology_strategy_test: ./seastar/include/seastar/core/future.hh:634: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.
Aborting on shard 0.
Backtrace:
0x0000000002825e94
0x000000000282ffa9
0x00007fd065f971df
/lib64/libc.so.6+0x000000000003dbc4
/lib64/libc.so.6+0x00000000000268a3
/lib64/libc.so.6+0x0000000000026788
/lib64/libc.so.6+0x0000000000035fc5
0x0000000000b484cf
0x0000000002a7c69f
0x0000000002a7c62f
0x0000000000b47b9e
0x0000000002595da2
0x0000000002595913
0x0000000002a83a31
```
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
scoped_critical_alloc_section was recently introduced to replace
disable_failure_guard and made the old class deprecated.
This patch replaces all occurences of disable_failure_guard with
scoped_critical_alloc_section.
Without this patch the build prints many warnings like:
warning: 'disable_failure_guard' is deprecated: Use scoped_critical_section instead [-Wdeprecated-declarations]
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <ca2a91aaf48b0f6ed762a6aa687e6ac5e936355d.1605621284.git.piotr@scylladb.com>
As requested in #7057, allow certain alterations of system_auth tables. Potentially destructive alterations are still rejected.
Tests: unit (dev)
Closes#7606
* github.com:scylladb/scylla:
auth: Permit ALTER options on system_auth tables
auth: Add command_desc
auth: Add tests for resource protections
And since now there is no danger of them filling the logs, the log-level
is promoted to info, so users can see the diagnostics messages by
default.
The rate-limit chosen is 1/30s.
Refs: #7398
Tests: manual
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201117091253.238739-1-bdenes@scylladb.com>
This commit makes it possible to change hints manager's configuration at
runtime through HTTP API.
To preserve backwards compatibility, we keep the old behavior of not
creating and checking hints directories if they are not enabled at
startup. Instead, hint directories are lazily initialized when hints are
enabled for the first time through HTTP API.
The GET `hinted_handoff_enabled_by_dc` endpoint had an incorrect return
type specified. Although it does not have an implementation, yet, it was
supposed to return a list of strings with DC names for which generating
hints is enabled - not a list of string pairs. Such return type is
expected by the JMX.
Implements a function which is responsible for changing hints manager
configuration while it is running.
It first starts new endpoint managers for endpoints which weren't
allowed by previous filter but are now, and then stops endpoint managers
which are rejected by the new filter.
The function is blocking and waits until all relevant ep managers are
started or stopped.
Now, the hints manager object for regular hints is always created, even
if hints are disabled in configuration. Please note that the behavior of
hints will be unchanged - no hints will be sent when they are disabled.
The intent of this change is to make enabling and disabling hints in
runtime easier to implement.
Uses db::hints::host_filter as the type of hinted_handoff_enabled
configuration option.
Previously, hinted_handoff_enabled used to be a string option, and it
was parsed later in a separate function during startup. The function
returned a std::optional<std::unordered_set<sstring>>, whose meaning in
the context of hints is rather enigmatic for an observer not familiar
with hints.
Now, hinted_handoff_enabled has type of db::hints::host_filter, and it
is plugged into the config parsing framework, so there is no need for
later post-processing.
Adds a db::hints::host_filter structure, which determines if generating
hints towards a given target is currently allowed. It supports
serialization and deserialization between the hinted_handoff_enabled
configuration/cli option.
This patch only introduces this structure, but does not make other code
use it. It will be plugged into the configuration architecture in the
following commits.
This change modifies db::hints::resource_manager so that it is now
possible to add hints::managers after it was started.
This change will make it possible to register the regular hints manager
later in runtime, if it wasn't enabled at boot time.
Introduces a db::hints::directory_initializer object, which encapsulates
the logic of initializing directories for hints (creating/validating
directories, segment rebalancing). It will be useful for lazy
initialization of hints manager.
Currently, the `directories` class is used exclusively during
initialization, in the main() function. This commit refactors this class
so that it is possible to use it to initialize directories much later
after startup.
The intent of this change is to make it possible for hints manager to
create directories for hints lazily. Currently, when Scylla is booted
with hinted handoff disabled, the `hints_directory` config parameter is
ignored and directories for hints are neither created nor verified.
Because we would like to preserve this behavior and introduce
possibility to switch hinted handoff on in runtime, the hints
directories will have to be created lazily the first time hinted handoff
is enabled.
* seastar 043ecec7...c861dbfb (3):
> Merge "memory: allow configuring when to dump memory diagnostics on allocation failures" from Botond
> perftune.py: support kvm-clock on tune-clock
> execution_stage: inheriting_concrete_execution_stage: add get_stats()
These alterations cannot break the database irreparably, so allow
them.
Expand command_desc as required.
Add a type (rather than command_desc) parameter to
has_column_family_access() to minimize code changes.
Fixes#7057
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When a node bootstraps or upgrades from a pre-CDC version, it creates a
new CDC generation, writes it to a distributed table
(system_distributed.cdc_generation_descriptions), and starts gossiping
its timestamp. When other nodes see the timestamp being gossiped, they
retrieve the generation from the table.
The bootstrapping/upgrading node therefore assumes that the generation
is made durable and other nodes will be able to retrieve it from the
table. This assumption could be invalidated if periodic commitlog mode
was used: replicas would acknowledge the write and then immediately
crash, losing the write if they were unlucky (i.e. commitlog wasn't
synced to disk before the write was acknowledged).
This commit enforces all writes to the generations table to be
synced to commitlog immediately. It does not matter for performance as
these writes are very rare.
Fixes https://github.com/scylladb/scylla/issues/7610.
Closes#7619
An entry can be snapshotted, before the outgoing message is sent, so the
message has to hold to it to avoid use after free.
Message-Id: <20201116113323.GA1024423@scylladb.com>
Materialized view updates participate in a retirement program,
which makes sure that they are immediately taken down once their
target node is down, without having to wait for timeout (since
views are a background operation and it's wasteful to wait in the
background for minutes). However, this mechanism has very delicate
lifetime issues, and it already caused problems more than once,
most recently in #5459.
In order to make another bug in this area less likely, the two
implementations of the mechanism, in on_down() and drain_on_shutdown(),
are unified.
Possibly refs #7572Closes#7624
Commit e5be3352cf ("database, streaming, messaging: drop
streaming memtables") removed streaming memtables; this removes
the mechanisms to synchronize them: _streaming_flush_gate and
_streaming_flush_phaser. The memory manager for streaming is removed,
and its 10% reserve is evenly distributed between memtables and
general use (e.g. cache).
Note that _streaming_flush_phaser and _streaming_flush_date are
no longer used to syncrhonize anything - the gate is only used
to protect the phaser, and the phaser isn't used for anything.
Closes#7454
DEBIAN_FRONTEND environment variable was added just for prevent opening
dialog when running 'apt-get install mdadm', no other program depends on it.
So we can move it inside of apt_install()/apt_uninstall() and drop scylla_env,
since we don't have any other environment variables.
To passing the variable, added env argument on run()/out().
"
This is a follow-up on 052a8d036d
"Avoid stalls in token_metadata and replication strategy"
The added mutate_token_metadata helper combines:
- with_token_metadata_lock
- get_mutable_token_metadata_ptr
- replicate_to_all_cores
Test: unit(dev)
"
* tag 'mutate_token_metadata-v1' of github.com:bhalevy/scylla:
storage_service: fixup indentation
storage_service: mutate_token_metadata: do replicate_to_all_cores
storage_service: add mutate_token_metadata helper
Replicate the mutated token_metadata to all cores on success.
This moves replication out of update_pending_ranges(mutable_token_metadata_ptr, sstring),
so add explicit call to replicate_to_all_cores where it is called outside
of mutate_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Replace a repeating pattern of:
with_token_metadata_lock([] {
return get_mutable_token_metadata_ptr([] (mutable_token_metadata_ptr tmptr) {
// mutate token_metadata via tmptr
});
});
With a call to mutate_token_metadata that does both
and calls the function with then mutable_token_metadata_ptr.
A following patch will also move the replication to all
cores to mutate_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The "dist" target fails as follows:
$ ./tools/toolchain/dbuild ninja dist
ninja: error: 'build/dev/scylla-unified-package-..tar.gz', needed by 'dist-unified-tar', missing and no known rule to make it
Fix two issues:
- Fix Python variable references to "scylla_version" and
"scylla_release", broken by commit bec0c15ee9 ("configure.py: Add
version to unified tarball filename"). The breakage went unnoticed
because ninja default target does not call into dist...
- Remove dependencies to build/<mode>/scylla-unified-package.tar.gz. The
file is now in build/<mode>/dist/tar/ directory and contains version
and release in the filename.
Message-Id: <20201113110706.150533-1-penberg@scylladb.com>
To test handling of connectivity issues and recovery add support for
disconnecting servers.
This is not full partitioning yet as it doesn't allow connectivity
across the disconnected servers (having multiple active partitions.
* https://github.com/alecco/scylla/pull/new/raft-ale-partition-simple-v3:
raft: replication test: connectivity partitioning support
raft: replication test: block rpc calls to disconnected servers
raft: replication test: add is_disconnected helper
raft: replication test: rename global variable
raft: replication test: relocate global connection state map
We currently keep a copy of scylla-package.tar.gz in "build/<mode>" for
compatibility. However, we've long since switched our CI system over to
the new location, so let's remove the duplicate and use the one from
"build/<mode>/dist/tar" instead.
Message-Id: <20201113075146.67265-1-penberg@scylladb.com>
Add to the DynamoDB compatibility document, docs/alternator/compatibility.md,
a mention that Alternator streams are still an experimental features, and
how to turn it on (at this point CDC is no longer an experimental feature,
but Alternator Streams are).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112184436.940497-1-nyh@scylladb.com>
Drop the adjective "experimental" used to describe Alternator in
docs/alternator/getting-started.md.
In Scylla, the word "experimental" carries a specific meaning - no support
for upgrades, not enough QA, not ready for general use) and Alternator is
no longer experimental in that sense.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112185249.941484-1-nyh@scylladb.com>
This adds some test cases for ALTER KEYSPACE:
- ALTER KEYSPACE happy path
- ALTER KEYSPACE wit invalid options
- ALTER KEYSPACE for non-existing keyspace
- CREATE and ALTER KEYSPACE using NetworkTopologyStrategy with
non-existing data center in configuration, which triggers a bug in
Scylla:
https://github.com/scylladb/scylla/issues/7595
Message-Id: <20201112073110.39475-1-penberg@scylladb.com>
Introduce partition update command consisting of nodes still seeing
each other. Nodes not included are disconnected from everything else.
If the previous leader is not part of the new partition, the first node
specified in the partition will become leader.
For other nodes to accept a new leader it has to have a committed log.
For example, if the desired leader is being re-connected and it missed
entries other nodes saw it will not win the election. Example A B C:
partition{A,C},entries{2},partition{B,C}
In this case node C won't accept B as a new leader as it's missing 2
entries.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
In main.cc, we spawn a future which starts the hints manager, but we
don't wait for it to complete. This can have the following consequences:
- The hints manager does some asynchronous operations during startup,
so it can take some time to start. If it is started after we start
handling requests, and we admit some requests which would result in
hints being generated, those hints will be dropped instead because we
check if hints manager is started before writing them.
- Initialization of hints manager may fail, and Scylla won't be stopped
because of it (e.g. we don't have permissions to create hints
directories). The consequence of this is that hints manager won't be
started, and hints will be dropped instead of being written. This may
affect both regular hints manager, and the view hints manager.
This commit causes us to wait until hints manager start and see if there
were any errors during initialization.
Fixes#7598Closes#7599
CDC is ready to be a non-experimental feature so remove the experimental flag for it.
Also, guard Alternator Streams with their own experimental flag. Previously, they were using CDC experimental flag as they depend on CDC.
Tests: unit(dev)
Closes#7539
* github.com:scylladb/scylla:
alternator: guard streams with an experimental flag
Mark CDC as GA
cdc: Make it possible for CDC generation creation to fail
Add new alternator-streams experimental flag for
alternator streams control.
CDC becomes GA and won't be guarded by an experimental flag any more.
Alternator Streams stay experimental so now they need to be controlled
by their own experimental flag.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Following patch enables CDC by default and this means CDC has to work
will all the clusters now.
There is a problematic case when existing cluster with no CDC support
is stopped, all the binaries are updated to newer version with
CDC enabled by default. In such case, nodes know that they are already
members of the cluster but they can't find any CDC generation so they
will try to create one. This creation may fail due to lack of QUORUM
for the write.
Before this patch such situation would lead to node failing to start.
After the change, the node will start but CDC generation will be
missing. This will mean CDC won't be able to work on such cluster before
nodetool checkAndRepairCdcStreams is run to fix the CDC generation.
We still fail to bootstrap if the creation of CDC generation fails.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
When a missing base column happens to be named `idx_token`,
an additional helper message is printed in logs.
This additional message does not need to have `error` severity,
since the previous, generic message is already marked as `error`.
This patch simply makes it easier to write tests, because in case
this error is expected, only one message needs to be explicitly
ignored instead of two.
Closes#7597
This miniseries adds metrics which can help the users detect potential overloads:
* due to having too many in-flight hints
* due to exceeding the capacity of the read admission queue, on replica side
Closes#7584
* github.com:scylladb/scylla:
reader_concurrency_semaphore: add metrics for shed reads
storage_proxy: add metrics for too many in-flight hints failures
If interposer consumer is enabled, partition filtering will be done by the
consumer instead, but that's not possible because only the producer is able
to skip to the next partition if the current one is filtered out, so scylla
crashes when that happens with a bad function call in queue_reader.
This is a regression which started here: 55a8b6e3c9
To fix this problem, let's make sure that partition filtering will only
happen on the producer side.
Fixes#7590.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201111221513.312283-1-raphaelsc@scylladb.com>
"
This series is a rebased version of 3 patchsets that were sent
separately before:
1. [PATCH v4 00/17] Cleanup storage_service::update_pending_ranges et al.
This patchset cleansup service/storage_service use of
update_pending_ranges and replicate_to_all_cores.
It also moves some functionality from gossiping_property_file_snitch::reload_configuration
into a new method - storage_service::update_topology.
This prepares storage_service for using a shared ptr to token_metadata,
updating a copy out of line under a semaphore that serializes writers,
and eventually replicating to updated copy to all shards and releasing
the lock. This is a follow up to #7044.
2. [PATCH v8 00/20] token_metadata versioned shared ptr
Rather than keeping references on token_metadata use a shared_token_metadata
containing a lw_shared_ptr<token_metadata> (a.k.a token_metadata_ptr)
to keep track of the token_metadata.
Get token_metadata_ptr for a read-only snapshot of the token_metadata
or clone one for a mutable snapshot that is later used to safely update
the base versioned_shared_object.
token_metadata_ptr is used to modify token_metadata out of line, possibly with
multiple calls, that could be preeempted in-between so that readers can keep a consistent
snapshot of it while writers prepare an updated version.
Introduce a token_metadata_lock used to serialize mutators of token_metadata_ptr.
It's taken by the storage_service before cloning token_metadata_ptr and held
until the updated copy is replicated on all shards.
In addition, this series introduces token_metadata::clone_async() method
to copy the tokne_metadata class using a asynchronous function with
continuations to avoid reactor stalls as seen in #7220.
Fixes#7044
3. [PATCH v3 00/17] Avoid stalls in token_metadata and replication strategy
This series uses the shared_token_metadata infrastructure.
First patches in the series deal wth cloning token_metadata
using continuations to allow preemption while cloning (See #7220).
Then, the rest of the series makes sure to always run
`update_pending_ranges` and `calculate_pending_ranges_for_*` in a thread,
it then adds a `can_yield` parameter to the token_metadata and abstract_replication_strategy
`get_pending_ranges` and friends, and finally it adds `maybe_yield` calls
in potentially long loops.
Fixes#7313Fixes#7220
Test: unit (dev)
Dtest: gating(dev)
"
* tag 'replication_strategy_can_yield-v4' of github.com:bhalevy/scylla: (54 commits)
token_metadata_impl: set_pending_ranges: add can_yield_param
abstract_replication_strategy: get rid of get_ranges_in_thread
repair: call get_ranges_in_thread where possible
abstract_replication_strategy: add can_yield param to get_pending_ranges and friends
abstract_replication_strategy: define can_yield bool_class
token_metadata_impl: calculate_pending_ranges_for_* reindent
token_metadata_impl: calculate_pending_ranges_for_* pass new_pending_ranges by ref
token_metadata_impl: calculate_pending_ranges_for_* call in thread
token_metadata: update_pending_ranges: create seastar thread
abstract_replication_strategy: add get_address_ranges method for specific endpoint
token_metadata_impl: clone_after_all_left: sort tokens only once
token_metadata: futurize clone_after_all_left
token_metadata: futurize clone_only_token_map
token_metadata: use mutable_token_metadata_ptr in calculate_pending_ranges_for_*
repair: replace_with_repair: use token_metadata::clone_async
storage_service: reindent token_metadata blocks
token_metadata: add clone_async
abstract_replication_strategy: accept a token_metadata_ptr in get_pending_address_ranges methods
abstract_replication_strategy: accept a token_metadata_ptr in get_ranges methods
boot_strapper: get_*_tokens: use token_metadata_ptr
...
Add a test that better clarifies what StartingSequenceNumber returned by
DescribeStream really guarantees (this question was raised in a review
of a different patch). The main thing we can guarantee is that reading a
shard from that position returns all the information in that shard -
similar to TRIM_HORIZON. This test verifies this, and it passes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112081250.862119-1-nyh@scylladb.com>
When the admission queue capacity reaches its limits, excessive
reads are shed in order to avoid overload. Each such operation
now bumps the metrics, which can help the user judge if a replica
is overloaded.
This change adds metrics for counting request message types
listed in the CQL v.4 spec under section 4.1
(https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v4.spec).
To organize things properly, we introduce a new cql_server::transport_stats
object type for aggregating the message and server statistics.
Fixes#4888Closes#7574
Rewrite in a more readable way that will later allow us to split the WHERE expression in two: a storage-reading part and a post-read filtering part.
Tests: unit (dev,debug)
Closes#7591
* github.com:scylladb/scylla:
cql3: Rewrite need_filtering() from scratch
cql3: Store index info in statement_restrictions
The name of the utility function test_object_name() is confusing - by
starting with the word "test", pytest can think (if it's imported to the
top-level namespace) that it is a test... So this patch gives it a better
name - unique_name().
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201111140638.809189-1-nyh@scylladb.com>
This patch adds a new document, docs/alternator/compatibility.md,
which focuses on what users switching from DynamoDB to Alternator
need to know about where Alternator differs from DynamoDB and which
features are missing.
The compatibility information in the old alternator.md is not deleted
yet. It probably should.
Fixes#7556
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201110180242.716295-1-nyh@scylladb.com>
* seastar a62a80ba1d...043ecec732 (8):
> semaphore: make_expiry_handler: explicitly use this lambda capture
> configure: add --{enable,disable}-debug-shared-ptr option
> cmake: add SEASTAR_DEBUG_SHARED_PTR also in dev mode
> tls_test: Update the certificates to use sha256
> logger: allow applying a rate-limit to log messages
> Merge "Handle CPUs not attached to any NUMA nodes" from Pavel E
> memory: fix malloc_usable_size() during early initialization
> Merge "make semaphore related functions noexcept" from Benny
Makes it easier to understand, in preparation for separating the WHERE
expression into filtering and storage-reading parts.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
To rewrite need_filtering() in a more readable way, we need to store
info on found indexes in statement_restrictions data members.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
The functions can be simplified as they are all now being called
from a seastar thread.
Make them sequential, returning void, and yielding if necessary.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Some of the callers of get_address_ranges are interested in the ranges
of a specific endpoint.
Rather than building a map for all endpoints and then traversing
it looking for this specific endpoint, build a multimap of token ranges
relating only to the specified endpoint.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the sorted tokens are copied needlessly by on this path
by `clone_only_token_map` and then recalculated after calling
remove_endpoint for each leaving endpoint.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Call the futurized clone_only_token_map and
remove the _leaving_endpoints from the cloned token_metadata_impl.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Does part of clone_async() using continuations to prevent stalls.
Rename synchronous variant to clone_only_token_map_sync
that is going to be deprecated once all its users will be futurized.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Replacing old code using lw_shared_ptr<token_metadata> with the "modern"
mutable_token_metadata_ptr alias.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Clone the input token_metadata asynchronously using
clone_async() before modifying it using update_normal_tokens.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Many code blocks using with_token_metadata_lock
and get_mutable_token_metadata_ptr now need re-indenting.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Only replace_with_repair needs to clone the token_metadata
and update the local copy, so we can safely pass a read-only
snapshot of the token_metadata rather than copying it in all cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Perform replication in 2 phases.
First phase just clones the mutable_token_metadata_ptr on all shards.
Second phase applies the cloned copies onto each local_ss._shared_token_metadata.
That phase should never fail.
To add suspenders over the belt, in the impossible case we do get an
exception, it is logged and we abort.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
clone _token_metadata for updating into _updated_token_metadata
and use it to update the local token_metadata on all shard via
do_update_pending_ranges().
Adjust get_token_metadata to get either the update the updated_token_metadata,
if available, or the base token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than using `serialized_action`, grab a lock before mutating
_token_metadata and hold it until its replicated to all shards.
A following patch will use a mutable token_metadata_ptr
that is updated out of line under the lock.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the replication to other shards happens later in `prepare_to_join`
that is called in `init_server`.
We should isolate the changes made by init_server and update them first
to all shards so that we can serialize them easily using a lock
and a mutable_token_metadata_ptr, otherwise the lock and the mutable_token_metadata_ptr
will have to be handed over (from this call path) to `prepare_to_join`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
get() the latest token_metadata_ptr from the
shared_token_metadata before each use.
expose get_token_metadata_ptr() rather than get_token_metadata()
so that caller can keep it across continuations.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To facilitate that, keep a const shared_token_metadata& in class database
rather than a const token_metadata&
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In preparation to chaging network_topology_strategy to
accept a const shared_token_metadata& rather than token_metadata&.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than accessing abstract_replication_strategy::_token_metedata directly.
In preparation to changing it to a shared_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use it to get a token_metadata& compatible
with current usage, until the services are converted to
use token_metadata_ptr.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that `replicate_tm_only` doesn't throw, we handle all errors
in `replicate_tm_only().handle_exception`.
We can't just proceed with business as usual if we failed to replicate
token_metadata on all shards and continue working with inconsistent
copies.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And with that mark also do_replicate_to_all_cores as noexcept.
The motivation to do so is to catch all errors in replicate_tm_only
and calling on_internal_error in the `handle_exception` continuation
in do_replicate_to_all_cores.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than calling invalidate_cached_rings and update_topology
on all shards do that only on shard 0 and then replicate
to all other shards using replicate_to_all_cores as we do
in all other places that modify token_metadata.
Do this in preparation to using a token_metadata_ptr
with which updating of token_metadata is done on a cloned
copy (serialized under a lock) that becomes visible only when
applied with replicate_to_all_cores.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the functionality from gossiping_property_file_snitch::reload_configuration
to the storage_service class.
With that we can make get_mutable_token_metadata private.
TODO: update token_metadata on shard 0 and then
replicate_to_all_cores rather than updating on all shards
in parallel.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
keyspace_changed just calls update_pending_ranges (and ignoring any
errors returned from it), so invoke it on shard 0, and with
that update_pending_ranges() is always called on shard 0
and it doesn't need to use `invoke_on` shard 0 by itself.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We need to assert in only 2 places:
do_update_pending_ranges, that updates token metadata,
and replicate_tm_only, that copies the token metadata
to all other shards.
Currently we throw errors if this is violated
but it should never happen and it's not really recoverable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently update_pending_ranges involves 2 serialized actions:
do_update_pending_ranges, and then replicate_to_all_cores.
These can be combind by calling do_replicate_to_all_cores
directly from do_update_pending_ranges.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It was introduced in 74b4035611
As part of the fix for #3203.
However, the reactor stalls have nothing to do with gossip
waiting for update_pending_ranges - they are related to it being
synchronous and quadratic in the number of tokens
(e.g. get_address_ranges calling calculate_natural_endpoints
for every token then simple_strategy::calculate_natural_endpoints
calling get_endpoint for every token)
There is nothing special in handle_state_leaving that requires
moving update_pending_ranges to the background, we call
update_pending_ranges in many other places and wait for it
so if gossip loop waiting on it was a real problem, then it'd
be evident in many other places.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently _update_pending_ranges_action is called only on shard 0
and only later update_pending_ranges() updates shard 0 again and replicates
the result to all shards.
There is no need to wait between the two, and call _update_pending_ranges_action
again, so just call update_pending_ranges() in the first place.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
so that the updated host_id (on shard 0) will get replicated to all shards
via update_pending_ranges.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It's already done by each handle_state_* function
either by directly calling replicate_to_all_cores or indirectly, via
update_pending_renages.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the updates to token_metadata are immediately visible
on shard 0, but not to other shards until replicate_to_all_cores
syncs them.
To prepare for converting to using shared token_metadata.
In the new world the updated token_metadata is not visible
until committed to the shared_token_metadata, so
commit it here and replicate to all other shards.
It is not clear this isn't needed presently too.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If the consumer happens to check the EOS flag before it hits the
exception injected by the abort (by calling fill_buffer()), they can
think the stream ended normally and expect it to be valid. However this
is not guaranteed when the reader is aborted. To avoid consumers falsely
thinking the stream ended normally, don't set the EOS flag on abort at
all.
Additionally make sure the producer is aborted too on abort. In theory
this is not needed as they are the one initiating the abort, but better
to be safe then sorry.
Fixes: #7411
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201102100732.35132-1-bdenes@scylladb.com>
"
Today, if scylla crashes mid-way in sstable::idempotent-move-sstable
or sstable::create_links we may end up in an inconsistent state
where it refuses to restart due to the presence of the moved-
sstable component files in both the staging directory and
main directory.
This series hardens scylla against this scenario by:
1. Improving sstable::create_links to identify the replay condition
and support it.
2. Modifying the algorithm for moving sstables between directories
to never be in a state where we have two valid sstable with the
same generation, in both the source and destination directories.
Instead, it uses the temporary TOC file as a marker for rolling
backwards or forewards, and renames it atomically from the
destination directory back to the source directory as a commit
point. Before which it is preparing the sstable in the destination
dir, and after which it starts the process of deleting the sstable
in the source dir.
Fixes#7429
Refs #5714
"
* tag 'idempotent-move-sstable-v3' of github.com:bhalevy/scylla:
sstable: create_links: support for move
sstable_directory: support sstables with both TemporaryTOC and TOC
sstable: create_links: move automatic sstring variables
sstable: create_links: use captured comps
sstable: create_links: capture dir by reference
sstable: create_links: fix indentation
sstable: create_links: no need to roll-back on failure anymore
sstable: create_links: support idempotent replay
sstable: create_links: cleanup style
sstable: create_links: add debug/trace logging
sstable: move_to_new_dir: rm TOC last
sstable: move_to_new_dir: io check remove calls
test: add sstable_move_test
Previously, test/cql-pytest/run was a Python script, while
test/cql-pytest/run-cassandra (to run the tests against Cassandra)
was still a shell script - modeled after test/alternator/run.
This patch makes rewrites run-cassandra in Python.
A lot of the same code is needed for both run and run-cassandra
tools. test/cql-pytest/run was already written in a way that this
common code was separate functions. For example, functions to start a
server in a temporary directory, to check when it finishes booting,
and to clean up at the end. This patch moves this common code to
a new file, "run.py" - and the tools "run" and "cassandra-run" are
very short programs which mostly use functions from run.py (run-cassandra
also has some unique code to run Cassandra, that no other test runner
will need).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201110215210.741753-1-nyh@scylladb.com>
We currently set PATH for relocatable CLI tools in scylla_util.run() and
scylla_util.out(), but it doesn't work for perftune.py, since it's not part of
Scylla, does not use scylla_util module.
We can set PATH in python thunk instead, it can set PATH for all python scripts.
Fixes#7350
After full cluster shutdown, the node which is being replaced will not have its
STATUS set to NORMAL (bug #6088), so listeners will not update _token_metadata.
The bootstrap procedure of replacing node has a workaround for this
and calls update_normal_tokens() on token metadata on behalf of the
replaced node based on just its TOKENS state obtained in the shadow
round.
It does this only for the replacing_a_node_with_same_ip case, but not
for replacing_a_node_with_diff_ip. As a result, replacing the node
with the same ip after full cluster shutdown fails.
We can always call update_normal_tokens(). If the cluster didn't
crash, token_metadata would get the tokens.
Fixes#4325
Message-Id: <1604675972-9398-1-git-send-email-tgrabiec@scylladb.com>
This patch introduces a new way to do functional testing on Scylla,
similar to Alternator's test/alternator but for the CQL API:
The new tests, in test/cql-pytest, are written in Python (using the pytest
framework), and use the standard Python CQL driver to connect to any CQL
implementation - be it Scylla, Cassandra, Amazon Keyspaces, or whatever.
The use of standard CQL allows the test developer to easily run the same
test against both Scylla and Cassandra, to confirm that the behaviour that
our test expects from Scylla is really the "correct" (meaning Cassandra-
compatible) behavior.
A developer can run Scylla or Cassandra manually, and run "pytest"
to connect to them (see README.md for more instructions). But even more
usefully, this patch also provides two scripts: test/cql-pytest/run and
test/cql-pytest/run-cassandra. These scripts automate the task of running
Scylla or Cassandra (respectively) in a random IP address and temporary
directory, and running the tests against it.
The script test/cql-pytest/run is inspired by the existing test run
scripts of Alternator and Redis, but rewritten in Python in a way that
will make it easy to rewrite - in a future patch - all these other run
scripts to use the same common code to safely run a test server in a
temporary directory.
"run" is extremely quick, taking around two seconds to boot Scylla.
"run-cassandra" is slower, taking 13 seconds to boot Cassandra (maybe
this can be improved in the future, I still don't know how).
The tests themselves take milliseconds.
Although the 'run' script runs a single Scylla node, the developer
can also bring up any size of Scylla or Cassandra cluster manually
and run the tests (with "pytest") against this cluster.
This new test framework differs from the existing alternatives in the
following ways:
dtest: dtest focuses on testing correctness of *distributed* behavior,
involving clusters of multiple nodes and often cluster changes
during the test. In contrast, cql-pytest focuses on testing the
*functionality* of a large number of small CQL features - which
can usually be tested on a single-node cluster.
Additionally, dtest is out-of-tree, while cql-pytest is in-tree,
making it much easier to add or change tests together with code
patches.
Finally, dtest tests are notoriously slow. Hundreds of tests in
the new framework can finish faster than a single dtest.
Slow and out-of-tree tests are difficult to write, and I believe
this explains why no developer loves writing dtests and maintainers
do not insist on having them. I hope cql-pytest can change that.
test/cql: The defining difference between the existing test/cql suite
and the new test/cql-pytest is the new framework is programmatic,
Python code, not a text file with desired output. Tests written with
` code allow things like looping, repeating the same test with different
parameters. Also, when a test fails, it makes it easier to understand
why it failed beyond just the fact that the output changed.
Moreover, in some cases, the output changes benignly and cql-pytest
may check just the desired features of the output.
Beyond this, the current version of test/cql cannot run against
Cassandra. test/cql-pytest can.
The primary motivation for this new framework was
https://github.com/scylladb/scylla/issues/7443 - where we had an
esoteric feature (sort order of *partitions* when an index is addded),
which can be shown in Cqlsh to have what we think is incorrect behavior,
and yet: 1. We didn't catch this bug because we never wrote a test for it,
possibly because it too difficult to contribute tests, and 2. We *thought*
that we knew what Cassandra does in this case, but nobody actually tested
it. Yes, we can test it manually with cqlsh, but wouldn't everything be
better if we could just run the same test that we wrote for Scylla against
Cassandra?
So one of the tests we add in this patch confirms issue #7443 in Scylla,
and that our hunch was correct and Cassandra indeed does not have this
problem. I also add a few trivial tests for keyspace create and drop,
as additional simple examples.
Refs #7443.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201110110301.672148-1-nyh@scylladb.com>
I noticed that we require filtering for continuous clustering key, which is not necessary. I dropped the requirement and made sure the correct data is read from the storage proxy.
The corresponding dtest PR: https://github.com/scylladb/scylla-dtest/pull/1727
Tests: unit (dev,debug), dtest (next-gating, cql*py)
Closes#7460
* github.com:scylladb/scylla:
cql3: Delete some newlines
cql3: Drop superfluous ALLOW FILTERING
cql3: Drop unneeded filtering for continuous CK
When there are too many in-flight hints, writes start returning
overloaded exceptions. We're missing metrics for that, and these could
be useful when judging if the system is in overloaded state.
This commit changes the build file generation and the package
creation scripts to be product aware. This will change the
relocatable package archives to be named after the product,
this commit deals with two main things:
1. Creating the actual Scylla server relocatable with a product
prefixed name - which is independent of any other change
2. Expect all other packages to create product prefixed archive -
which is dependant uppon the actual submodules creating
product prefixed archives.
If the support is not introduced in the submodules first this
will break the package build.
Tests: Scylla full build with the original product and a
different product name.
Closes#7581
Currently debian_files_gen.py mistakenly renames scylla-server.service to
"scylla-server." on non-standard product name environment such as
scylla-enterprise, it should be fix to correct filename.
Fixes#7423
This patch introduces many changes to the Scylla `CMakeLists.txt`
to enable building Scylla without resorting to pre-building
with a previous configure.py build, i.e. cmake script can now
be used as a standalone solution to build and execute scylla.
Submodules, such as Seastar and Abseil, are also dealt with
by importing their CMake scripts directly via `add_subdirectory`
calls. Other submodules, such as `libdeflate` now have a
custom command to build the library at runtime.
There are still a lot of things that are incomplete, though:
* Missing auxiliary packaging targets
* Unit-tests are not built (First priority to address in the
following patches)
* Compile and link flags are mostly hardcoded to the values
appropriate for the most recent Fedora 33 installation.
System libraries should be found via built-in `Find*` scripts,
compiler and linker flags should be observed and tested by
executing feature tests.
* The current build is aimed to be built by GCC, need to support
Clang since we are moving to it.
* Utility cmake functions should be moved to a separate "cmake"
directory.
The script is updated to use the most recent CMake version available
in Fedora 33, which is 3.18.
Right now this is more of a PoC rather that a full-fledged solution
but as far as it's not used widely, we are free to evolve it in
a relaxed manner, improving it step by step to achieve feature
parity with `configure.py` solution.
The value in this patch is that now we are able to use any
C++ IDE capable of dealing with CMake solutions and take
advantage of their built-in capabilities, such as:
* Building a code model to efficiently navigate code.
* Find references to symbols.
* Use pretty-printers, beautifiers and other tools conveniently.
* Run scylla and debug it right from the IDE.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201103221619.612294-1-pa.solodovnikov@scylladb.com>
DescribeTable should return a UUID "TableId" in its reponse.
We alread had it for CreateTable, and now this patch adds it to
DescribeTable.
The test for this feature is no longer xfail. Moreover, I improved
the test to not only check that the TableId field is present - it
should also match the documented regular expression (the standard
representation of a UUID).
Refs #5026
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201104114234.363046-1-nyh@scylladb.com>
When moving a sstable between directories, we would like to
be able to crash at any point during the algorithm with a
clear way to either roll the operation forwards or backwards.
To achieve that, define sstable::create_links_common that accepts
a `mark_for_removal` flag, implementing the following algorithm:
1. link src.toc to dst.temp_toc.
until removed, the destination sstable is marked for removal.
2. link all src components to dst.
crashing here will leave dst with both temp_toc and toc.
3.
a. if mark_for_removal is unset then just remove dst.temp_toc.
this is commit the destination sstable and complete create_links.
b. if mark_for_removal is set then move dst.temp_toc to src.temp_toc.
this will atomically toggle recovery after crash from roll-back
to roll-forward.
here too, crashing at this point will leave src with both
temp_toc and toc.
Adjust the unit test for the revised algorithm.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Keep descriptors in a map so it could be searched easily by generation.
and possibly delete the descriptor, if found, in the presence of
a temporary toc component.
A following patch will add support to create_links for moving
sstables between directories. It is based on keeping a TemporaryTOC
file in the destination directory while linking all source components.
If scylla crashes here, the destination sstable will have both
its TemporaryTOC and TOC components and it needs to be removed
to roll the move backwards.
Then, create_links will atomically move the TemporaryTOC from
the destination back to the source directory, to toggle rolling
back to rolling forward by marking the source sstable for removal.
If scylla crashes here, the source sstable will have both
its TemporaryTOC and TOC components and it needs to be removed
to roll the move forward.
Add unit test for this case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that we use `idempotent_link_file` it'll no longer
fail with EEXIST in a replay scenario.
It may fail on ENOENT, and return an exceptional future.
This will be propagated up the stack. Since it may indicate
parallel invokation of move_to_new_dir, that deletes the source
sstable while this thread links it to the same destination,
rolling back by removing the destination links would
be dangerous.
For an other error, the node is going to be isolated
and stop operation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Handle the case where create_link is replayed after crashing in the middle.
In particular, if we restart when moving sstables from staging to the base dir,
right after create_links completes, and right before deleting the source links,
we end up with seemingly 2 valid sstables, one still in staging and the other
already in the base table directory, both are hard linked to the same inodes.
Make create_links idempotent so it can replay the operation safely if crashed and
restarted at any point of its operation.
Add unit tests for replay after partial create_links that is expected to succeed,
and a test for replay when an sstable exist in the destination that is not
hard-linked to the source sstable; create_links is expected to fail in this case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To facilitate cleanup on crash, first rename the TOC file to TOC.tmp,
and keep until all other files are removed, finally remove TOC.tmp.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This use-after move was apprently exposed after switching to clang
in commit eb861e68e9.
The directory_entry is required for std::stoi(de.name.c_str())
and later in the catch{} clause.
This shows in the node logs as a "Ignore invalid directory" debug
log message with an empty name, and caused the hintedhandoff_rebalance_test
to fail when hints files aren't rebalanced.
Test: unit(dev)
DTest: hintedhandoff_additional_test.py:TestHintedHandoff.hintedhandoff_rebalance_test (dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201106172017.823577-1-bhalevy@scylladb.com>
On older distribution such as CentOS7, it does not support systemd user mode.
On such distribution nonroot mode does not work, show warning message and
skip running systemctl --user.
Fixes#7071
... to config descriptions
We allow setting the transitional auth as one of the options
in scylla.yaml, but don't mention it at all in the field's
description. Let's change that.
Closes#7565
The function is used by raft and fails with ubsan and clang.
The ub is harmless. Lets wait for it to be fixed in boost.
Message-Id: <20201109090353.GZ3722852@scylladb.com>
Retry mechanism didn't work when URLError happend. For example:
urllib.error.URLError: <urlopen error [Errno 101] Network is unreachable>
Let's catch URLError instead of HTTP since URLError is a base exception
for all exceptions in the urllib module.
Fixes: #7569Closes#7567
If _offset falls beyond compound_type->types().size()
ignore the extra components instead of accessing out of the types
vector range.
FIXME: we should validate the thrift key against the schema
and reject it in the thrift handler layer.
Refs #7568
Test: unit(dev)
DTest: cql_tests.py:MiscellaneousCQLTester.cql3_insert_thrift_test (dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201108175738.1006817-1-bhalevy@scylladb.com>
Users can change `durable_writes` anytime with ALTER KEYSPACE.
Cassandra reads the value of `durable_writes` every time when applying
a mutation, so changes to that setting take effect immediately. That is,
mutations are added to the commitlog only when `durable_writes` is `true`
at the moment of their application.
Scylla reads the value of `durable_writes` only at `keyspace` construction time,
so changes to that setting take effect only after Scylla is restarted.
This patch fixes the inconsistency.
Fixes#3034Closes#7533
This series provides assorted fixes which are a
pre-requisite for the joint consensus implementation
series which follows.
* scylla-dev/raft-misc:
raft: fix raft_fsm_test flakiness
raft: drop a waiter of snapshoted entry
raft: use correct type for node info in add_server()
raft: overload operator<< for debugging
An index that is waited can be included in an installed snapshot in
which case there is no way to know if the entry was committed or not.
Abort such waiters with an appropriate error.
Overload operator<< for ostream and print relevant state for server, fsm, log,
and typed_uint64 types.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The query processor is present in the global namespace and is
widely accessed with global get(_local)?_query_processor().
There's a long-term task to get rid of this globality and make
services and componenets reference each-other and, for and
due-to this, start and stop in specific order. This set makes
this for the query processor.
The remaining users of it are -- alternator, controllers for
client services, schema_tables and sys_dist_ks. All of them
except for the schema_tables are fixed just by passing the
reference on query processor with small patches. The schema
tables accessing qp sit deep inside the paxos code, but can
be "fixed" with the qctx thing until the qctx itself is
de-globalized.
* https://github.com/xemul/scylla/tree/br-rip-global-query-processor:
code: RIP global query processor instance
cql test env: Keep query processor reference on board
system distributed keyspace: Start sharded service erarlier
schema_tables: Use qctx to make internal requests
transport: Keep sharded query processor reference on controller
thrift: Keep sharded query processor reference on controller
alternator: Use local query processor reference to get keys
alternator: Keep local query processor reference in server
The only purpose of this change is to compile (git-bisect
safety) and thus prove that the next patch is correct.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the view builder cannot read view building progress from an
internal CQL table it produces an error message, but that only confuses
the user and the test suite -- this situation is entirely recoverable,
because the builder simply assumes that there is no progress and the
view building should start from scratch.
Fixes#7527Closes#7558
repair: Use single writer for all followers
Currently, repair master create one writer for each follower to write
rows from follower to sstables. That are RF - 1 writers in total. Each
writer creates 1 sstable for the range repaired, usually a vnode range.
Those sstables for a given vnode range are disjoint.
To reduce the compaction work, we can create one writer for all the
followers. This reduces the number of sstables generated by repair
significantly to one per vnode range from RF - 1 per vnode range.
Fixes#7525Closes#7528
* github.com:scylladb/scylla:
repair: No more vector for _writer_done and friends
repair: Use single writer for all followers
The default of DBUILD_TOOL=docker requires passwordless access to docker
by the user of dbuild. This is insecure, as any user with unconstrained
access to docker is root equivalent. Therefore, users might prefer to
run docker as root (e.g. by setting DBUILD_TOOL="sudo docker").
However, `$tool -e HOME` exports HOME as seen by $tool.
This breaks dbuild when `$tool` runs docker as a another user.
`$tool -e HOME="$HOME"` exports HOME as seen by dbuild, which is
the intended behaviour.
Closes#7555
Instead of invoking `$tool`, as is done everywhere else in dbuild,
kill_it() invoked `docker` explicitly. This was slightly breaking the
script for DBUILD_TOOL other than `docker`.
Closes#7554
Cleanup compaction is using consume_pausable_in_thread() to skip over
disowned partitions, which uses flat_mutation_reader::next_partition().
The implementation of next_partition() for the sstable reader has a
bug which may cause the following assertion failure:
scylla: sstables/mp_row_consumer.hh:422: row_consumer::proceed sstables::mp_row_consumer_k_l::flush(): Assertion `!_ready' failed.
This happens when the sstable reader's buffer gets full when we reach
the partition end. The last fragment of the partition won't be pushed
into the buffer but will stay in the _ready variable. When
next_partition() is called in this state, _ready will not be cleared
and the fragment will be carried over to the next partition. This will
cause assertion failure when the reader attempts to emit the first
fragment of the next partition.
The fix is to clear _ready when entering a partition, just like we
clear _range_tombstones there.
Fixes#7553.
Message-Id: <1604534702-12777-1-git-send-email-tgrabiec@scylladb.com>
Fixes returned rows ordering to proper signed token ordering. Before this change, rows were sorted by token, but using unsigned comparison, meaning that negative tokens appeared after positive tokens.
Rename `token_column_computation` to `legacy_token_column_computation` and add some comments describing this computation.
Added (new) `token_column_computation` which returns token as `long_type`, which is sorted using signed comparison - the correct ordering of tokens.
Add new `correct_idx_token_in_secondary_index` feature, which flags that the whole cluster is able to use new `token_column_computation`.
Switch token computation in secondary indexes to (new) `token_column_computation`, which fixes the ordering. This column computation type is only set if cluster supports `correct_idx_token_in_secondary_index` feature to make sure that all nodes
will be able to compute new `token_column_computation`. Also old indexes will need to be rebuilt to take advantage of this fix, as new token column computation type is only set for new indexes.
Fix tests according to new token ordering and add one new test to validate this aspect explicitly.
Fixes#7443
Tested manually a scenario when someone created an index on old version of Scylla and then migrated to new Scylla. Old index continued to work properly (but returning in wrong order). Upon dropping and re-creating the index, it still returned the same data, but now in correct order.
Closes#7534
* github.com:scylladb/scylla:
tests: add token ordering test of indexed selects
tests: fix tests according to new token ordering
secondary_index: use new token_column_computation
feature: add correct_idx_token_in_secondary_index
column_computation: add token_column_computation
token_column_computation: rename as legacy
The shared_from_this lw_shared_ptr must not be accessed
across shards. Capturing it in the lambda passed to
mutation_writer::distribute_reader_and_consume_on_shards
causes exactly that since the captured lw_shared_ptr
is copied on other shards, and ends up in memory corruption
as seen in #7535 (probably due to lw_shared_ptr._count
going out-of-sync when incremented/decremented in parallel
on other shards with no synchronization.
This was introduced in 289a08072a.
The writer is not needed in the body of this lambda anyways
so it doesn't need to capture it. It is already held
by the continuations until the end of the chain.
Fixes#7535
Test: repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test (dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201104142216.125249-1-bhalevy@scylladb.com>
"
Since we are switching to clang due to raft make it actually compile
with clang.
"
tgrabiec: Dropped the patch "raft: compile raft by default" because
the replication_test still fails in debug mode:
/usr/include/boost/container/deque.hpp:1802:63: runtime error: applying non-zero offset 8 to null pointer
* 'raft-clang-v2' of github.com:scylladb/scylla-dev:
raft: Use different type to create type dependent statement for static assertion
raft: drop use of <ranges> for clang
raft: make test compile with clang
raft: drop -fcoroutines support from configure.py
Now that both repair followers and repair master use a single writer. We
can get rid of the vector associated with _writer_done and friends.
Fixes#7525
Currently, repair master create one writer for each follower to write
rows from follower to sstables. That are RF - 1 writers in total. Each
writer creates 1 sstable for the range repaired, usually a vnode range.
Those sstables for a given vnode range are disjoint.
To reduce the compaction work, we can create one writer for all the
followers. This reduces the number of sstables generated by repair
significantly to one per vnode range from RF - 1 per vnode range.
Fixes#7525
A gcc bug [1] caused objects built by different versions of gcc
not to interoperate. Gcc helpfully warns when it encounters code that
could be affected.
Since we build everything with one version, and as that versions is far
newer than the last version generating incorrect code, we can silence
that warning without issue.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728Closes#7495
Do not run tests which are not built.
For that, pass the test list from configure.py to test.py
via ninja unit_test_list target.
Minor cleanups.
* scylla-dev.git/test.py-list:
test: enable raft tests
test.py: do not run tests which are not built
configure.py: add a ninja command to print unit test list
test.py: handle ninja mode_list failure
configure.py: don't pass modes_list unless it's used
Add new test validating that rows returned from both non-indexed selects
and indexed selects return rows sorted in token order (making sure
that both positive and negative tokens are present to test if signed
comparison order is maintained).
Switches token column computation to (new) token_column_computation,
which fixes#7443, because new token column will be compared using
signed comparisons, not the previous unsigned comparison of CQL bytes
type.
This column computation type is only set if cluster supports
correct_idx_token_in_secondary_index feature to make sure that all nodes
will be able to compute (new) token_column_computation. Also old
indexes will need to be rebuilt to take advantage of this fix, as new
token column computation type is only set for new indexes.
Add new correct_idx_token_in_secondary_index feature, which will be used
to determine if all nodes in the cluster support new
token_column_computation. This column computation will replace
legacy_token_column_computation in secondary indexes, which was
incorrect as this column computation produced values that when compared
with unsigned comparison (CQL type bytes comparison) resulted in
different ordering than token signed comparison. See issue:
https://github.com/scylladb/scylla/issues/7443
Introduce new token_column_computation class which is intended to
replace legacy_token_column_computation. The new column computation
returns token as long_type, which means that it will be ordered
according to signed comparison (not unsigned comparison of bytes), which
is the correct ordering of tokens.
Raname token_column_computation to legacy_token_column_computation, as
it will be replaced with new column_computation. The reason is that this
computation returns bytes, but all tokens in Scylla can now be
represented by int64_t. Moreover, returning bytes causes invalid token
ordering as bytes comparison is done in unsigned way (not signed as
int64_t). See issue:
https://github.com/scylladb/scylla/issues/7443
meaningful
When computing moving average rates too early after startup, the
rate can be infinite, this is simply because the sample interval
since the system started is too small to generate meaningful results.
Here we check for this situation and keep the rate at 0 if it happens
to signal that there are still no meaningful results.
This incident is unlikely to happen since it can happen only during a
very small time window after restart, so we add a hint to the compiler
to optimize for that in order to have a minimum impact on the normal
usecase.
Fixes#4469
The memory configuration for the database object was left at zero.
This can cause the following chain of failures:
- the test is a little slow due to the machine being overloaded,
and debug mode
- this causes the memtable flush_controller timer to fire before
the test completes
- the backlog computation callback is called
- this calculates the backlog as dirty_memory / total_memory; this
is 0.0/0.0, which resolves to NaN
- eventually this gets converted to an integer
- UBSAN dooesn't like the convertion from NaN to integer, and complains
Fix by initializing dbcfg.available_memory.
Test: gossip_test(debug), 1000 repetitions with concurrency 6
Closes#7544
Fixes#7325
When building with clang on fedora32, calling the string_view constructor
of bignum generates broken ID:s (i.e. parsing borks). Creating a temp
std::string fixes it.
Closes#7542
Since 11a8912093, get_gossip_status
returns a std::string_view rather than a sstring.
As seen in dtest we may print garbage to the log
if we print the string_view after preemption (calling
_gossiper.reset_endpoint_state_map().get())
Test: update_cluster_layout_tests:TestUpdateClusterLayout.simple_add_two_nodes_in_parallel_test (dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201103132720.559168-1-bhalevy@scylladb.com>
"
Introduce a gentle (yielding) implementation of reserve for chunked
vector and use it when reserving the backing storage vector for large
bitset. Large bitset is used by bloom filters, which can be quite large
and have been observed to cause stalls when allocating memory for the
storage.
Fixes: #6974
Tests: unit(dev)
"
* 'gentle-reserve/v1' of https://github.com/denesb/scylla:
utils/large_bitset: use reserve_partial() to reserve _storage
utils/chunked_vector: add reserve_partial()
There's a perf_bptree test that compares B+ tree collection with
std::set and std::map ones. There will come more, also the "patterns"
to compare are not just "fill with keys" and "drain to empty", so
here's the perf_collection test, that measures timings of
- fill with keys
- drain key by key
- empty with .clear() call
- full scan with iterator
- insert-and-remove of a single element
for currently used collections
- std::set
- std::map
- intrusive_set_external_comparator
- bplus::tree
* https://github.com/xemul/scylla/tree/br-perf-collection-test:
test: Generalize perf_bptree into perf_collection
perf_collection: Clear collection between itartions
perf_collection: Add intrusive_set_external_comparator
perf_collection: Add test for single element insertion
perf_collection: Add test for destruction with .clear()
perf_collection: Add test for full scan time
To avoid stalls when reserving memory for a large bloom filter. The
filter creation already has a yielding loop for initialization, this
patch extends it to reservation of memory too.
A variant of reserve() which allows gentle reserving of memory. This
variant will allocate just one chunk at a time. To drive it to
completion, one should call it repeatedly with the return value of the
previous call, until it returns 0.
This variant will be used in the next patch by the large bitset creation
code, to avoid stalls when allocating large bloom filters (which are
backed by large bitset).
Although the code for it existed already, the validation function
hasn't been invoked properly. This change fixes that, adding
a validating check when converting from text to specific value
type and throwing a marshal exception if some characters
are not ASCII.
Fixes#5421Closes#7532
When unit tests fail the test.py dump their output on the screen. This is impossible
to read this output from the terminal, all the more so the logs are anyway saved in
the testlog/ directory. At the same time the names of the failed tests are all left
_before_ these logs, and if the terminal history is not large enough, it becomes
quite annoying to find the names out.
The proposal is not to spoil the terminal with raw logs -- just names and summaries.
Logs themselves are at testlog/$mode/$name_of_the_test.log
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201031154518.22257-1-xemul@scylladb.com>
Otherwise all readers will be created with the default forwarding::yes.
This inhibits some optimizations (e.g. results in more sstable read-ahead).
It will also be problematic when we introduce mutation sources which don't support
forwarding::yes in the future.
Message-Id: <1604065206-3034-1-git-send-email-tgrabiec@scylladb.com>
Clang brings us working support for coroutines, which are
needed for Raft and for code simplification.
perf_simple_query as well as full system tests show no
significant performance regression.
Test: unit(dev, release, debug)
Closes#7531
Fixes#7496
Since cdc log now has an end-of-batch/record marker that tells
us explicitly that we've read the last row of a change, we
can use this instead of timestamp checks + limit extra to
ensure we have complete records.
Note that this does not try to fulfill user query limit
exact. To do this we would need to add a loop and potentially
re-query if quried rows are not enough. But that is a
separate exercise, and superbly suited for coroutines!
Closes#7498
* github.com:scylladb/scylla:
alternator::streams: Reduce the query limit depending on cdc opts
alternator::streams: Use end-of-record info in get_records
This series cleans up the gossiper endpoint_state interface
marking methods const and const noexcept where possible.
To achieve that, endpoint_state::get_status was changed to
return a string_view rather than a sstring so it won't
need to allocate memory.
Also, the get_cluster_name and get_partitioner_name were
changes to return a const sstring& rather than sstring
so they won't need to allocate memory.
The motivation for the series stems from #7339
where an exception in get_host_id within a storage_service
notification handler, called from seastar::defer crashed
the server.
With this series, get_host_id may still throw exceptions on
logical error, but not from calling get_application_state_ptr.
Refs #7339
Test: unit(dev)
* tag 'gossiper-endpoint-noexcept-v2':
gossiper: mark trivial methods noexcept
gossiper: get_cluster_name, get_partitioner_name: make noexcept
gossiper: get_gossip_status: return string_view and make noexcept
gms/endpoint_state: mark methods using get_status noexcept
gms/endpoint_state: get_status: return string_view and make noexcept
gms/endpoint_state: mark get_application_state_ptr and is_cql_ready noexcept
gms/endpoint_state: mark trivial methods noexcept
gms/heart_beat_state: mark methods noexcept
gms/versioned_value: mark trivial methods noexcept
gms/version_generator: mark get_next_version noexcept
fb_utilities.hh: mark methods noexcept
messaging: msg_addr: mark methods noexcept
gms/inet_address: mark methods noexcept
when logging in to the GCE instance that is created from the GCE image it takes 10 seconds to understand that we are not running on AWS. Also, some unnecessary debug logging messages are printed:
```
bentsi@bentsi-G3-3590:~/devel/scylladb$ ssh -i ~/.ssh/scylla-qa-ec2 bentsi@35.196.8.86
Warning: Permanently added '35.196.8.86' (ECDSA) to the list of known hosts.
Last login: Sun Nov 1 22:14:57 2020 from 108.128.125.4
_____ _ _ _____ ____
/ ____| | | | | __ \| _ \
| (___ ___ _ _| | | __ _| | | | |_) |
\___ \ / __| | | | | |/ _` | | | | _ <
____) | (__| |_| | | | (_| | |__| | |_) |
|_____/ \___|\__, |_|_|\__,_|_____/|____/
__/ |
|___/
Version:
666.development-0.20201101.6be9f4938
Nodetool:
nodetool help
CQL Shell:
cqlsh
More documentation available at:
http://www.scylladb.com/doc/
By default, Scylla sends certain information about this node to a data collection server. For information, see http://www.scylladb.com/privacy/
WARNING:root:Failed to grab http://169.254.169.254/latest/...
WARNING:root:Failed to grab http://169.254.169.254/latest/...
Initial image configuration failed!
To see status, run
'systemctl status scylla-image-setup'
[bentsi@artifacts-gce-image-jenkins-db-node-aa57409d-0-1 ~]$
```
this PR fixes this
Closes#7523
* github.com:scylladb/scylla:
scylla_util.py: remove unnecessary logging
scylla_util.py: make is_aws_instance faster
scylla_util.py: added ability to control sleep time between retries in curl()
Old secondary index schemas did not have their idx_token column
marked as computed, and there already exists code which updates
them. Unfortunately, the fix itself contains an error and doesn't
fire if computed columns are not yet supported by the whole cluster,
which is a very common situation during upgrades.
Fixes#7515Closes#7516
Fixes#7496
Since cdc log now has an end-of-batch/record marker that tells
us explicitly that we've read the last row of a change, we
can use this instead of timestamp checks + limit extra to
ensure we have complete records.
Note that this does not try to fulfill user query limit
exact. To do this we would need to add a loop and potentially
re-query if quried rows are not enough. But that is a
separate exercise, and superbly suited for coroutines!
The instructions are updated for multiarch images (images that
can be used on x86 and ARM machines).
Additionally,
- docker is replaced with podman, since that is now used by
developers. Docker is still supported for developers, but
the image creation instructions are only tested with podman.
- added instructions about updating submodules
- `--format docker` is removed. It is not necessary with
more recent versions of docker.
Closes#7521
connection_notifier.hh defines a number of template-specialized
variables in a header. This is illegal since you're allowed to
define something multiple times if it's a template, but not if it's
fully specialized. gcc doesn't care but clang notices and complains.
Fix by defining the variiables as inline variables, which are
allowed to have definitions in multiple translation units.
Closes#7519
Some ARM cores are slow, and trip our current timeout of 3000
seconds in debug mode. Quadrupling the timeout is enough to make
debug-mode tests pass on those machines.
Since the timeout's role is to catch rare infinite loops in unsupervised
testing, increasing the timeout has no ill effect (other than to
delay the report of the failure).
Closes#7518
The main goal of this this series is to fix issue #6951 - a Query (or Scan) with
a combination of filtering and projection parameters produced wrong results if
the filter needs some attributes which weren't projected.
This series also adds new tests for various corner cases of this issue. These
new tests also pass after this fix, or still fail because some other missing
feature (namely, nested attributes). These additional tests will be important if
we ever want to refactor or optimize this code, because they exercise some rare
corner code paths at the intersection of filtering and projection.
This series also fixes some additional problems related to this issue, like
combining old and new filtering/projection syntaxes (should be forbidden), and
even one fix to a wrong comment.
Closes#7328
* github.com:scylladb/scylla:
alternator test: tests for nested attributes in FilterExpression
alternator test: fix comment
alternator tests: additional tests for filter+projection combination
alternator: forbid combining old and new-style parameters
alternator: fix query with both projection and filtering
when calling curl and exception is raised we can see unnecessary log messages that we can't control.
For example when used in scylla_login we can see following messages:
WARNING:root:Failed to grab http://169.254.169.254/latest/...
WARNING:root:Failed to grab http://169.254.169.254/latest/...
Initial image configuration failed!
To see status, run
'systemctl status scylla-image-setup'
These methods can return a const sstring& rather than
allocating a sstring. And with that they can be marked noexcept.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Change get_gossip_status to return string_view,
and with that it can be noexcept now that it doesn't
allocate memory via sstring.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that get_status returns string_view, just compare it with a const char*
rather than making a sstring out of it, and consequently, can be marked noexcept.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
get_status doesn't need to allocate a sstring, it can just
return a std::string_view to the status string, if found.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Although std::map::find is not guaranteed to be noexcept
it depends on the comperator used and in this case comparing application_state
is noexcept. Therefore, we can safely mark get_application_state_ptr noexcept.
is_cql_ready depends on get_application_state_ptr and otherwise
handles an exceptions boost::lexical_cast so it can be marked
noexcept as well.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that get_next_version() is noexcept,
update_heart_beat can be noexcept too.
All others are trivially noexcept.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Based on gms::inet_address.
With that, gossiper::get_msg_addr can be marked noexcept (and const while at it).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Clang does not implement P1814R0 (class template argument deduction
for alias templates), so it can't deduce the template arguments
for range_bound, but it can for interval_bound, so switch to that.
Using the modern name rather than the compatibility alias is preferred
anyway.
Closes#7422
In commit de38091827 the two IO priority classes streaming_read
and streaming_write into just one. The document docs/isolation.md
leaves a lot to be desired (hint, hint, to anyone reading this and
can write content!) but let's at least not have incorrect information
there.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201101102220.2943159-1-nyh@scylladb.com>
This series adds more context to debugging information in case a view gets out of sync with its base table.
A test was conducted manually, by:
1. creating a table with a secondary index
2. manually deleting computed column information from system_schema.computed_columns
3. restarting the target node
4. trying to write to the index
Here's what's logged right after the index metadata is loaded from disk:
```
ERROR 2020-10-30 12:30:42,806 [shard 0] view - Column idx_token in view ks.t_c_idx_index was not found in the base table ks.t
ERROR 2020-10-30 12:30:42,806 [shard 0] view - Missing idx_token column is caused by an incorrect upgrade of a secondary index. Please recreate index ks.t_c_idx_index to avoid future issues.
```
And here's what's logged during the actual failure - when Scylla notices that there exists
a column which is not computed, but it's also not found in the base table:
```
ERROR 2020-10-30 12:31:25,709 [shard 0] storage_proxy - exception during mutation write to 127.0.0.1: seastar::internal::backtraced<std::runtime_error> (base_schema(): operation unsupported when initialized only for view reads. Missing column in the base table: idx_token Backtrace: 0x1d14513
0x1d1468b
0x1d1492b
0x109bbad
0x109bc97
0x109bcf4
0x1bc4370
0x1381cd3
0x1389c38
0xaf89bf
0xaf9b20
0xaf1654
0xaf1afe
0xb10525
0xb10ad8
0xb10c3a
0xaaefac
0xabf525
0xabf262
0xac107f
0x1ba8ede
0x1bdf749
0x1be338c
0x1bfe984
0x1ba73fa
0x1ba77a4
0x9ea2c8
/lib64/libc.so.6+0x27041
0x9d11cd
--------
seastar::lambda_task<seastar::execution_stage::flush()::{lambda()#1}>
```
Hopefully, this information will make it much easier to solve future problems with out-of-sync views.
Tests: unit(dev)
Fixes#7512Closes#7513
* github.com:scylladb/scylla:
view: add printing missing base column on errors
view: simplify creating base-dependent info for reads only
view: fix typo: s/dependant/dependent
view: add error logs if a view is out of sync with its base
In certain CQL statements it's possible to provide a custom timestamp via the USING TIMESTAMP clause. Those values are accepted in microseconds, however, there's no limit on the timestamp (apart from type size constraint) and providing a timestamp in a different unit like nanoseconds can lead to creating an entry with a timestamp way ahead in the future, thus compromising the table.
To avoid this, this change introduces a sanity check for modification and batch statements that raises an error when a timestamp of more than 3 days into the future is provided.
Fixes#5619Closes#7475
The constructors just set up the references, real start happens in .start()
so it is safe to do this early. This helps not carrying migration manager
and query processor down the storage service cluster joining code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The query processor global instance is going away. The schema_tables usage
of it requires a huge rework to push the qp reference to the needed places.
However, those places talk to system keyspace and are thus the users of the
"qctx" thing -- the query context for local internal requests.
To make cql tests not crash on null qctx pointer, its initialization should
come earlier (conforming to the main start sequence).
The qctx itself is a global pointer, which waits for its fix too, of course.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When an out-of-sync view is attempted to be used in a write operation,
the whole operation needs to be aborted with an error. After this patch,
the error contains more context - namely, the missing column.
The code which created base-dependent info for materialized views
can be expressed with fewer branches. Also, the constructor
which takes a single parameter is made explicit.
When Scylla finds out that a materialized view contains columns
which are not present in the base table (and they are not computed),
it now presents comprehensible errors in the log.
* seastar 6973080cd1...57b758c2f9 (11):
> http: handle 'match all' rule correctly
> http: add missing HTTP methods
> memory: remove unused lambda capture in on_allocation_failure()
> Support seastar allocator when seastar::alien is used
> Merge "make timer related functions noexcept" from Benny
> script: update dependecy packages for centos7/8
> tutorial: add linebreak between sections
> doc: add nav for the second last chap
> doc: add nav bar at the bottom also
> doc: rename add_prologue() to add_nav_to_body()
> Wrong name used in an example in mini tutorial.
An upcoming change in Seastar only initializes the Seastar allocator in
reactor threads. This causes imr_test and double_decker_test to fail:
1. Those tests rely on LSA working
2. LSA requires the Seastar allocator
3. Seastar is not initialized, so the Seastar allocator is not initialized.
Fix by switching to the Seastar test framework, which initializes Seastar.
Closes#7486
test.py estimates the amount of memory needed per test
in order not to overload the machine, but it underestimates
badly and so machines with many cores but not a lot of memory
fail the tests (in debug mode principally) due to running out
of memory.
Increase the estimate from 2GB per test to 6GB.
Closes#7499
gcc collects all the initialization code for thread-local storage
and puts it in one giant function. In combination with debug mode,
this creates a very large stack frame that overflows the stack
on aarch64.
Work around the problem by placing each initializer expression in
its own function, thus reusing the stack.
Closes#7509
Add additional comments to select_statement_utils, fix formatting, add
missing #pragma once and introduce set_internal_paging_size_guard to
set internal_paging in RAII fashion.
Closes#7507
Gossip currently runs inside the default (main) scheduling group. It is
fine to run inside default scheduling group. From time to time, we see
many tasks in main scheduling group and we suspect gossip. It is best
we can move gossip to a dedicated scheduling group, so that we can catch
bugs that leak tasks to main group more easily.
After this patch, we can check:
scylla_scheduler_time_spent_on_task_quota_violations_ms{group="gossip",shard="0"}
Fixes: #7154
Tests: unit(dev)
We require a kernel that is at least 3.10.0-514, because older
kernel have an XFS related bug that causes data corruption. However
this Requires: clause pulls in a kernel even in Docker installation,
where it (and especially the associated firmware) occupies a lot of
space.
Change to a Conflicts: instead. This prevents installation when
the really old kernel is present, but doesn't pull it in for the
Docker image.
Closes#7502
Overview
Fixes#7355.
Before this changes, there were a few invalid results of aggregates/GROUP BY on tables with secondary indexes (see below).
Unfortunately, it still does NOT fix the problem in issue #7043. Although this PR moves forward fixing of that issue, there is still a bug with `TOKEN(...)` in `WHERE` clauses of indexed selects that is not addressed in this PR. It will be fixed in my next PR.
It does NOT fix the problems in issues #7432, #7431 as those are out-of-scope of this PR and do not affect the correctness of results (only return a too large page).
GROUP BY (first commit)
Before the change, `GROUP BY` `SELECT`s with some `WHERE` restrictions on an indexed column would return invalid results (same grouped column values appearing multiple times):
```
CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ks_t on ks.t(v);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3);
SELECT pk FROM ks.t WHERE v=3 GROUP BY pk;
pk
----
1
1
```
This is fixed by correctly passing `_group_by_cell_indices` to `result_set_builder`. Fixes the third failing example from issue #7355.
Paging (second commit)
Fixes two issues related to improper paging on indexed `SELECT`s. As those two issues are closely related (fixing one without fixing the other causes invalid results of queries), they are in a single commit (second commit).
The first issue is that when using `slice.set_range`, the existing `_row_ranges` (which specify clustering key prefixes) are not taken into account. This caused the wrong rows to be included in the result, as the clustering key bound was set to a half-open range:
```
CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c));
CREATE INDEX kst_index ON ks.t(c);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5);
SELECT COUNT(*) FROM ks.t WHERE c = 3;
count
-------
2
```
The second commit fixes this issue by properly trimming `row_ranges`.
The second fixed problem is related to setting the `paging_state` to `internal_options`. It was improperly set to the value just after reading from index, making the base query start from invalid `paging_state`.
The second commit fixes this issue by setting the `paging_state` after both index and base table queries are done. Moreover, the `paging_state` is now set based on `paging_state` of index query and the results of base table query (as base query can return more rows than index query).
The second commit fixes the first two failing examples from issue #7355.
Tests (fourth commit)
Extensively tests queries on tables with secondary indices with aggregates and `GROUP BY`s.
Tests three cases that are implemented in `indexed_table_select_statement::do_execute` - `partition_slices`,
`whole_partitions` and (non-`partition_slices` and non-`whole_partitions`). As some of the issues found were related to paging, the tests check scenarios where the inserted data is smaller than a page, larger than a page and larger than two pages (and some in-between page boundaries scenarios).
I found all those parameters (case of `do_execute`, number of inserted rows) to have an impact of those fixed bugs, therefore the tests validate a large number of those scenarios.
Configurable internal_paging_size (third commit)
Before this change, internal `page_size` when doing aggregate, `GROUP BY` or nonpaged filtering queries was hard-coded to `DEFAULT_COUNT_PAGE_SIZE` (10,000). This change adds new internal_paging_size variable, which is configurable by `set_internal_paging_size` and `reset_internal_paging_size` free functions. This functionality is only meant for testing purposes.
Closes#7497
* github.com:scylladb/scylla:
tests: Add secondary index aggregates tests
select_statement: Introduce internal_paging_size
select_statement: Fix paging on indexed selects
select_statement: Fix GROUP BY on indexed select
Update the toolchain to Fedora 33 with clang 11 (note the
build still uses gcc).
The image now creates a /root/.m2/repository directory; without
this the tools/jmx build fails on aarch64.
Add java-1.8.0-openjdk-devel since that is where javac lives now.
Add a JAVA8_HOME environment variable; wihtout this ant is not
able to find javac.
The toolchain is enabled for x86_64 and aarch64.
Extensively tests queries on tables with secondary indices with
aggregates and GROUP BYs. Tests three cases that are implemented
in indexed_table_select_statement::do_execute - partition_slices,
whole_partitions and (non-partition_slices and non-whole_partitions).
As some of the issues found were related to paging, the tests check
scenarios where the inserted data is smaller than a page, larger than
a page and larger than two pages (and some boundary scenarios).
Before this change, internal page_size when doing aggregate, GROUP BY
or nonpaged filtering queries was hard-coded to DEFAULT_COUNT_PAGE_SIZE.
This made testing hard (timeouts in debug build), because the tests had
to be large to test cases when there are multiple internal pages.
This change adds new internal_paging_size variable, which is
configurable by set_internal_paging_size and reset_internal_paging_size
free functions. This functionality is only meant for testing purposes.
Fixes two issues related to improper paging on indexed SELECTs. As those
two issues are closely related (fixing one without fixing the other
causes invalid results of queries), they are in a single commit.
The first issue is that when using slice.set_range, the existing
_row_ranges (which specify clustering key prefixes) are not taken into
account. This caused the wrong rows to be included in the result, as the
clustering key bound was set to a half-open range:
CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c));
CREATE INDEX kst_index ON ks.t(c);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5);
SELECT COUNT(*) FROM ks.t WHERE c = 3;
count
-------
2
This change fixes this issue by properly trimming row_ranges.
The second fixed problem is related to setting the paging_state
to internal_options. It was improperly set just after reading from
index, making the base query start from invalid paging_state.
This change fixes this issue by setting the paging_state after both
index and base table queries are done. Moreover, the paging_state is
now set based on paging_state of index query and the results of base
table query (as base query can return more rows than index query).
Fixes the first two failing examples from issue #7355.
Before the change, GROUP BY SELECTs with some WHERE restrictions on an
indexed column would return invalid results (same grouped column values
appearing multiple times):
CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ks_t on ks.t(v);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3);
SELECT pk FROM ks.t WHERE v=3 GROUP BY pk;
pk
----
1
1
This is fixed by correctly passing _group_by_cell_indices to
result_set_builder. Fixes the third failing example from issue #7355.
This is the continuation of 30722b8c8e, so let me re-cite Rafael:
The constructors of these global variables can allocate memory. Since
the variables are thread_local, they are initialized at first use.
There is nothing we can do if these allocations fail, so use
disable_failure_guard.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201028140553.21709-1-xemul@scylladb.com>
The future of the fiber that writes data into sstables inside
the repair_writer is stored in _writer_done like below:
class repair_writer {
_writer_done[node_idx] =
mutation_writer::distribute_reader_and_consume_on_shards().then([this] {
...
}).handle_exception([this] {
...
});
}
The fiber access repair_writer object in the error handling path. We
wait for the _writer_done to finish before we destroy repair_meta
object which contains the repair_writer object to avoid the fiber
accessing already freed repair_writer object.
To be safer, we can make repair_writer a shared pointer and take a
reference in the distribute_reader_and_consume_on_shards code path.
Fixes#7406Closes#7430
LD_PRELOAD libraries usually have dependencies in the host system,
which they will not have access to in a relocatable environment
since we use a different libc. Detect that LD_PRELOAD is in use and if
so, abort with an error.
Fixes#7493.
Closes#7494
Clang complains if it sees linker-only flags when called for compilation,
so move the compile-time flags from cxx_ld_flags to cxxflags, and remove
cxx_ld_flags from the compiler command line.
The linker flags are also passed to Seastar so that the build-id and
interpreter hacks still apply to iotune.
Closes#7466
python3 has its own relocatable package, no need to include it
in scylla-package.tar.gz.
Python has its own relocatable package, so packaging it in scylla-package.ta
Closes#7467
We already have a test for the behavior of a closed shard and how
iterators previously created for it are still valid. In this patch
we add to this also checking that the shard id itself, not just the
iterator, is still valid.
Additionally, although the aforementioned test used a disabled stream
to create a closed shard, it was not a complete test for the behavior
of a disabled stream, and this patch adds such a test. We check that
although the stream is disabled, it is still fully usable (for 24 hours) -
its original ARN is still listed on ListStreams, the ARN is still usable,
its shards can be listed, all are marked as closed but still fully readable.
Both tests pass on DynamoDB, and xfail on Alternator because of
issue #7239 - CDC drops the CDC log table as soon as CDC is disabled,
so the stream data is lost immediately instead of being retained for
24 hours.
Refs #7239
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201006183915.434055-1-nyh@scylladb.com>
This patch fills the following columns in `system.clients` table:
* `connection_stage`
* `driver_name`
* `driver_version`
* `protocol_version`
It also improves:
* `client_type` - distinguishes cql from thrift just in case
* `username` - now it displays correct username iff `PasswordAuthenticator` is configured.
What is still missing:
* SSL params (I'll happily get some advice here)
* `hostname` - I didn't find it in tested drivers
Refs #6946Closes#7349
* github.com:scylladb/scylla:
transport: Update `connection_stage` in `system.clients`
transport: Retrieve driver's name and version from STARTUP message
transport: Notify `system.clients` about "protocol_version"
transport: On successful authentication add `username` to system.clients
This patch change the code that iterates over the metrics to use a copy
of the metrics names to make it safe to remove the metrics from the
metrics object.
Fixes#7488
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Allows QA to bypass the normal hardcoded 24h ttl of data and still
get "proper" behaviour w.r.t. available stream set/generations.
I.e. can manually change cdc ttl option for alternator table after
streams enabled. Should not be exposed, but perhaps useful for
testing.
Closes#7483
Refs #7364
The number of tombstones can be large. As a stopgap measure to
just returning a source range (with keepalive), we can at least
alleviate the problem by using a chunked vector.
Closes#7433
Fixes#7435
Adds an "eor" (end-of-record) column to cdc log. This is non-null only on
last-in-timestamp group rows, i.e. end of a singular source "event".
A client can use this as a shortcut to knowing whether or not he has a
full cdc "record" for a given source mutation (single row change).
Closes#7436
In bd6855e, we reverted to Boost ranges and commented out the concept
check. But Boost has its own concept check, which this patch enables.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7471
Currently, we linearize large UTF8 cells in order to validate them.
This can cause large latency spikes if the cell is large.
This series changes UTF8 validation to work on fragmented buffers.
This is somewhat tricky since the validation routines are optimized
for single-instruction-multiple-data (SIMD) architectures.
The unit tests are expanded to cover the new functionality.
Fixes#7448.
Closes#7449
* github.com:scylladb/scylla:
types: don't linearize utf8 for validation
test: utf8: add fragmented buffer validation tests
utils: utf8: add function to validate fragmented buffers
utils: utf8: expose validate_partial() in a header
utils: utf8: introduce validate_partial()
utils: utf8: extract a function to evaluate a single codepoint
This PR purpose is to handle schema integrity issues that can arise in races involving materialized views.
The possibility of such integrity issues was found in #7420 , where a view schema was used for reading without
it's _base_info member initialized resulting in a segfault.
We handle this doing 3 things:
1. First guard against using an uninitialized base info - this will be considered as an internal error as it will indicate that
there is a path in our code that creates a view schema to be used for reads or writes but is not initializing the base info.
2. We allow the base info to be initialized also from partially matching base (most likely a newer one that this used to create the view).
3. We fix the suspected path that create such a view schema to initialize it. (in migration manager)
It is worth mentioning that this PR is a workaround to a probable design flaw in our materialized views which requires the base
table's information to be retrieved in the first place instead of just being self contained.
Refs #7420Closes#7469
* github.com:scylladb/scylla:
materialized views: add a base table reference if missing
view info: support partial match between base and view for only reading from view.
view info: guard against null dereference of the base info
schema pointers can be obtained from two distinct entities,
one is the database, those schema are obtained from the table
objects and the other is from the schema registry.
When a schema or a new schema is attached to a table object that
represents a base table for views, all of the corresponding attached
view schemas are guarantied to have their base info in sync.
However if an older schema is inserted into the registry by the
migratrion manager i.e loaded from other node, it will be
missing this info.
This becomes a problem when this schema is published through the
schema registry as it can be obtained for an obsolete read command
for example and then eventually cause a segmentation fault by null
dereferencing the _base_info ptr.
Refs #7420
only reading from view.
The current implementation of materialized views does
no keep the version to which a specific version of materialized
view schema corresponds to. This complicate things especially on
old views versions that the schema doesn't support anymore. However,
the views, being also an independent table should allow reading from
them as long as they exist even if the base table changed since then.
For the reading purpose, we don't need to know the exact composition
of view primary key columns that are not part of the base primary
key, we only need to know that there are any, and this is a much
looser constrain on the schema.
We can rely on a table invariants such as the fact that pk columns are
not going to disappear on newer version of the table.
This means that if we don't find a view column in the base table, it is
not a part of the base table primary key.
This information is enough for us to perform read on the view.
This commit adds support for being able to rely on such partial
information along with a validation that it is not going to be used for
writes. If it is, we simply abort since this means that our schema
integrity is compromised.
The change's purpose is to guard against segfault that is the
result of dereferencing the _base_info member when it is
uninitialized. We already know this can happen (#7420).
The only purpose of this change is to treat this condition as
an internal error, the reason is that it indicates a schema integrity
problem.
Besides this change, other measures should be taken to ensure that
the _base_table member is initialized before calling methods that
rely on it.
We call the internal_error as a last resort.
Use the new non-linearizing validator, avoiding linearization.
Linearization can cause large contiguous memory allocations, which
in turn causes latency spikes.
Fixes#7448.
Since there are a huge number of variations, we use random
testing. Each test case is composed of a random number of valid
code points, with a possible invalid code point somehwere. The
test case is broken up into a random number of fragments. We
test both validation success and error position indicator.
Add a function to validate fragmented buffers. We validate
each buffer with SIMD-optimized validate_partial(), then
collect the codepoint that spans buffer boundaries (if any)
in a temporary buffer, validate that too, and continue.
The current validators expect the buffer to contain a full
UTF-8 string. This won't be the case for fragmented buffers,
since a codepoint can straddle two (or more) buffers.
To prepare for that, convert the existing validators to
validate_partial(), which returns either an error, or
success with an indication of the size of the tail that
was not validated and now many bytes it is missing.
This is natural since the SIMD validators already
cannot process a tail in SIMD mode if it's smaller than
the vector size, so only minor rearrangements are needed.
In addition, we now have validate_partial() for non-SIMD
architectures, since we'll need it for fragmented buffer
validation.
Our SIMD optimized validators cannot process a codepoint that
spans multiple buffers, and adapting them to be able to will slow
them down. So our strategy is to special-case any codepoint that
spans two buffers.
To do that, extract an evaluate_codepoint() function from the
current validate_naive() function. It returns three values:
- if a codepoint was successfully decoded from the buffer,
how many bytes were consumed
- if not enough bytes were in the buffer, how many more
are needed
- otherwise, an error happened, so return an indication
The new function uses a table to calculate a codepoint's
size from its first byte, similar to the SIMD variants.
validate_naive() is now implemented in terms of
evaluate_codepoint().
This is a regression caused by aebd965f0.
After the sstable_directory changes, resharding now waits for all sstables
to be exhausted before releasing reference to them, which prevents their
resources like disk space and fd from being released. Let's restore the
old behavior of incrementally releasing resources, reducing the space
requirement significantly.
Fixes#7463.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201020140939.118787-1-raphaelsc@scylladb.com>
... type deduction and parenthesized aggregate construction' from Avi Kivity
Clang does not implement P0960R3 and P1816R0, so constructions
of aggregates (structs with no constructors) have to used braced initialization
and cannot use class template argument deduction. This series makes the
adjustments.
Closes#7456
* github.com:scylladb/scylla:
reader_concurrency_semaphore: adjust permit_summary construction for clang
schema_tables: adjust altered_schema construction for clang
types: adjust validation_visitor construction for clang
Makes files shorter while still keeping the lines under 120 columns.
Separate from other commits to make review easier.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Don't require filtering when a continuous slice of the clustering key
is requested, even if partition is unrestricted. The read command we
generate will fetch just the selected data; filtering is unnecessary.
Some tests needed to update the expected results now that we're not
fetching the extra data needed for filtering. (Because tests don't do
the final trim to match selectors and assert instead on all the data
read.)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Clang does not implement P0960R3, parenthesized initialization of
aggregates, so we have to use brace initialization in
permit_summary.
As the parenthesized constructor call is done by emplace_back(),
we have to do the braced call ourselves.
Clang does not implement P0960R3, parenthesized initialization of
aggregates, so we have to use brace initialization in
altered_schema.
As the parenthesized constructor call is done by emplace_back(),
we have to do the braced call ourselves.
Clang does not implement P0960R3, parenthesized initialization of
aggregates, so we have to use brace initialization in
validation_visitor. It also does not implement class template
argument deduction for aggregates (P1816r0), so we have to
specify the template parameters explicity.
Clang has trouble compiling libstdc++'s `<ranges>`. It is not known whether
the problem is in clang or in libstdc++; I filed bugs for both [1] [2].
Meanwhile, we wish to use clang to gain working coroutine support,
so drop the failing uses of `<ranges>`. Luckily the changes are simple.
[1] https://bugs.llvm.org/show_bug.cgi?id=47509
[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97120Closes#7450
* github.com:scylladb/scylla:
test: view_build_test: drop <ranges>
test: mutation_reader_test: drop <ranges>
cql3: expression: drop <ranges>
sstables: leveled_compaction_strategy: drop use of <ranges>
utils: to_range(): relax constraint
The check was added to support migration from schema tables format v2
to v3. It was needed to handle the rolling upgrade from 2.x to 3.x
scylla version. Old nodes wouldn't recognize new schema mutations, so
the pull handler in v3 was changed to ignore requests from v2 nodes based
on their advertised SCHEMA_TABLES_VERSION gossip state.
This started to cause problems after 3b1ff90 (get rid of the seed
concept). The bootstrapping node sometimes would hang during boot
unable to reach schema agreement.
It's relevant that gossip exchanges about new nodes are unidirectional
(refs #2862).
It's also relevant that pulls are edge-triggered only (refs #7426).
If the bootstrapping node (A) is listed as a seed in one of the
existing node's (B) configuration then node A can be contacted before
it contacts node B. Node A may then send schema pull request to node B
before it learns about node A, and node B will assume it's an old node
and give an empty response. As a result, node A will end up with an
old schema.
The fix is to drop the check so that pull handler always responds with
the schema. We don't support upgrades from nodes using v2 schema
tables format anymore so this should be safe.
Fixes#7396
Tests:
- manual (ccm)
- unit (dev)
Message-Id: <1602612578-21258-1-git-send-email-tgrabiec@scylladb.com>
The input range to utils::to_range() should be indeed a range,
but clang has trouble compiling <ranges> which causes it to fail.
Relax the constraint until this is fixed.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Hopefully, most of these lambda captures will be replaces with
coroutines.
Closes#7445
* github.com:scylladb/scylla:
test: mutation_reader_test: don't capture structured bindings in lambdas
api: column_family: don't capture structured bindings in lambdas
thrift: don't capture structured bindings in lambdas
test: partition_data_test: don't capture structured bindings in lambdas
test: querier_cache_test: don't capture structured bindings in lambdas
test: mutation_test: don't capture structured bindings in lambdas
storage_proxy: don't capture structured bindings in lambdas
db: hints/manager: don't capture structured bindings in lambdas
db: commitlog_replayer: don't capture structured bindings in lambdas
cql3: select_statement: don't capture structured bindings in lambdas
cql3: statement_restrictions: don't capture structured bindings in lambdas
cdc: log: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
"
Focusing on different aspects of debugging Scylla. Also expand some of
the existing segments and fix some small issues around the document.
"
* 'debugging.md-advanced-guides/v1' of https://github.com/denesb/scylla:
docs/debugging.md: add thematic debugging guides
docs/debugging.md: tips and tricks: add section about optimized-out variables
docs/debugging.md: TLS variables: add missing $ to terminal command
docs/debugging.md: TUI: describe how to switch between windows
docs/debugging.md: troubleshooting: expand on crash on backtrace
Before this change, invalid query exception on selects with both normal
and token restrictions was only thrown when token restriction was after
normal restriction.
This change adds proper validation when token restriction is before normal restriction.
**Before the change - does not return error in last query; returns wrong results:**
```
cqlsh> CREATE TABLE ks.t(pk int, PRIMARY KEY(pk));
cqlsh> INSERT INTO ks.t(pk) VALUES (1);
cqlsh> INSERT INTO ks.t(pk) VALUES (2);
cqlsh> INSERT INTO ks.t(pk) VALUES (3);
cqlsh> INSERT INTO ks.t(pk) VALUES (4);
cqlsh> SELECT pk, token(pk) FROM ks.t WHERE pk = 2 AND token(pk) > 0;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Columns "ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}" cannot be restricted by both a normal relation and a token relation"
cqlsh> SELECT pk, token(pk) FROM ks.t WHERE token(pk) > 0 AND pk = 2;
pk | system.token(pk)
----+---------------------
3 | 9010454139840013625
(1 rows)
```
Closes#7441
* github.com:scylladb/scylla:
tests: Add token and non-token conjunction tests
token_restriction: Add non-token merge exception
This patch introduces a new system table: `system.scylla_table_schema_history`,
which is used to keep track of column mappings for obsolete table
schema versions (i.e. schema becomes obsolete when it's being changed
by means of `CREATE TABLE` or `ALTER TABLE` DDL operations).
It is populated automatically when a new schema version is being
pulled from a remote in get_schema_definition() at migration_manager.cc
and also when schema change is being propagated to system schema tables
in do_merge_schema() at schema_tables.cc.
The data referring to the most recent table schema version is always
present. Other entries are garbage-collected when the corresponding
table schema version is obsoleted (they will be updated with a TTL equal
to `DEFAULT_GC_GRACE_SECONDS` on `ALTER TABLE`).
In case we failed to persist column mapping after a schema change,
missing entries will be recreated on node boot.
Later, the information from this table is used in `paxos_state::learn`
callback in case we have a mismatch between the most recent schema
version and the one that is stored inside the `frozen_mutation`
for the accepted proposal.
Such situation may arise under following circumstances:
1. The previous LWT operation crashed on the "accept" stage,
leaving behind a stale accepted proposal, which waits to be
repaired.
2. The table affected by LWT operation is being altered, so that
schema version is now different. Stored proposal now references
obsolete schema.
3. LWT query is retried, so that Scylla tries to repair the
unfinished Paxos round and apply the mutation in the learn stage.
When such mismatch happens, prior to that patch the stored
`frozen_mutation` is able to be applied only if we are lucky enough
and column_mapping in the mutation is "compatible" with the new
table schema.
It wouldn't work if, for example, the columns are reordered, or
some columns, which are referenced by an LWT query, are dropped.
With this patch we try to look up the column mapping for
the obsolete schema version, then upgrade the stored mutation
using obtained column mapping and apply an upgraded mutation instead.
* git@github.com:ManManson/scylla.git feature/table_schema_history_v7:
lwt: add column_mapping history persistence tests
schema: add equality operator for `column_mapping` class
lwt: store column_mapping's for each table schema version upon a DDL change
schema_tables: extract `fill_column_info` helper
frozen_mutation: introduce `unfreeze_upgrading` method
There are two basic tests, which:
* Test that column mappings are serialized and deserialized
properly on both CREATE TABLE and ALTER TABLE
* Column mappings for obsoleted schema versions are updated
with a TTL value on schema change
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Add a comparator for column mappings that will be used later
in unit-tests to check whether two column mappings match or not.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This patch introduces a new system table: `system.scylla_table_schema_history`,
which is used to keep track of column mappings for obsolete table
schema versions (i.e. schema becomes obsolete when it's being changed
by means of `CREATE TABLE` or `ALTER TABLE` DDL operations).
It is populated automatically when a new schema version is being
pulled from a remote in get_schema_definition() at migration_manager.cc
and also when schema change is being propagated to system schema tables
in do_merge_schema() at schema_tables.cc.
The data referring to the most recent table schema version is always
present. Other entries are garbage-collected when the corresponding
table schema version is obsoleted (they will be updated with a TTL equal
to `DEFAULT_GC_GRACE_SECONDS` on `ALTER TABLE`).
In case we failed to persist column mapping after a schema change,
missing entries will be recreated on node boot.
Later, the information from this table is used in `paxos_state::learn`
callback in case we have a mismatch between the most recent schema
version and the one that is stored inside the `frozen_mutation`
for the accepted proposal.
Such situation may arise under following circumstances:
1. The previous LWT operation crashed on the "accept" stage,
leaving behind a stale accepted proposal, which waits to be
repaired.
2. The table affected by LWT operation is being altered, so that
schema version is now different. Stored proposal now references
obsolete schema.
3. LWT query is retried, so that Scylla tries to repair the
unfinished Paxos round and apply the mutation in the learn stage.
When such mismatch happens, prior to that patch the stored
`frozen_mutation` is able to be applied only if we are lucky enough
and column_mapping in the mutation is "compatible" with the new
table schema.
It wouldn't work if, for example, the columns are reordered, or
some columns, which are referenced by an LWT query, are dropped.
With this patch we try to look up the column mapping for
the obsolete schema version, then upgrade the stored mutation
using obtained column mapping and apply an upgraded mutation instead.
In case we don't find a column_mapping we just return an error
from the learn stage.
Tests: unit(dev, debug), dtests(paxos_tests.py:TestPaxos.schema_mismatch_*_test)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Now that scylla-ccm and scylla-dtest conform to PEP-440
version comparison (See https://www.python.org/dev/peps/pep-0440/)
we can safely change scylla version on master to be the development
branch for the next release.
The version order logic is:
4.3.dev is followed by
4.3.rc[i] followed by
4.3.[n]
Note that also according to
https://blog.jasonantman.com/2014/07/how-yum-and-rpm-compare-versions/
4.3.dev < 4.3.rc[i] < 4.3.[n]
as "dev" < "rc" by alphabetical order
and both "dev" and "rc*" < any number, based on the general
rule that alphabetical strings compare as less than numbers.
Test: unit
Dtest: gating
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201015151153.726637-1-bhalevy@scylladb.com>
Commit bc65659a46 adjusted the inlining parameters for gcc. Here
we do the same for clang. With this adjustement, clang lags gcc
by 3% in throughput (perf_simple_query --smp 1) compared to 20%
without it.
The value 2500 was derived by binary search. At 5000 compilation
of storage_proxy never completes, at 1250 throughput is down by
10%.
Closes#7418
Support snapshotting for raft. The patch series only concerns itself
with raft logic, not how a specific state machine implements
take_snapshot() callback.
* scylla-dev/raft-snapshots-v2:
raft: test: add tests for snapshot functionality
raft: preserve trailing raft log entries during snapshotting
raft: implement periodic snapshotting of a state machine
raft: add snapshot transfer logic
Checks for invalid_request_exception in case of trying to run a query
with both normal and token relations. Tests both orderings of those
relations (normal or token relation first).
Add exception that is thrown when merging of token and non-token
restrictions is attempted. Before this change only merging non-token
and token restriction was validated (WHERE pk = 0 AND token(pk) > 0)
and not the other way (WHERE token(pk) > 0 AND pk = 0).
This patch allows to leave snapshot_trailing amount of entries
when a state machine is snapshotted and raft log entries are dropped.
Those entries can be used to catch up nodes that are slow without
requiring snapshot transfer. The value is part of the configuration
and can be changed.
The patch implements periodic taking of a snapshot and trimming of
the raft log.
In raft the only way the log of already committed entries can be shorten
is by taking a snapshot of the state machine and dropping log entries
included in the snapshot from the raft log. To not let log to grow too
large the patch takes the snapshot periodically after applying N number
of entries where N can be configured by setting snapshot_threshold
value in raft's configuration.
This patch adds the logic that detects that a follower misses data from
a snapshot and initiate snapshot transfer in that case. Upon receiving
the snapshot the follower stores it locally and applies it to its state
machine. The code assumes that the snapshot is already exists on a
leader.
"
This series cleans up the legacy and common ssatble writer code.
metadata_collector::_ancestors were moved to class sstable
so that the former can be moved out of sstable into file_writer_impl.
Moved setting of replay position and sstable level via
sstable_writer_config so that compaction won't need to access
the metadata_collector via the sstable.
With that, metadata_collector could be moved from class sstable
to sstable_writer::writer_impl along with the column_stats.
That allowed moved "generic" file_writer methods that were actually
k/l format specific into sstable_writer_k_l.
Eventually `file_writer` code is moved into sstables/writer.cc
and sstable_writer_k_l into sstables/kl/writer.{hh,cc}
A bonus cleanup is the ability to get rid of
sstable::_correctly_serialize_non_compound_range_tombstones as
it's now available to the writers via the writer configuration
and not required to be stored in the sstable object.
Fixes#3012
Test: unit(dev)
"
* tag 'cleanup-sstable-writer-v2' of github.com:bhalevy/scylla:
sstables: move writer code away to writer.cc
sstables: move sstable_writer_k_l away to kl/writer
sstables: get rid of sstable::_correctly_serialize_non_compound_range_tombstones
sstables: move writer methods to sstable_writer_k_l
sstables: move compaction ancestors to sstable
sstables: sstable_writer: optionally set sstable level via config
sstables: sstable_writer: optionally set replay position via config
sstables: compaction: make_sstable_writer_config
sstables: open code update_stats_on_end_of_stream in sstable_writer::consume_end_of_stream
sstables: fold components_writer into sstable_writer_k_l
sstables: move sstable_writer_k_l definition upwards
sstables: components_writer: turn _index into unique_ptr
Now it's available to the writers via the writer configuration
and not required to be stored in the sstable object.
Refs #3012
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
They are called solely from the sstable_writer_k_l path.
With that, moce the metadata collector and column stats
to writer_impl. They are now only used by the sstable
writers.
Refs #3012
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Compaction needs access to the sstable's ancestors so we need to
keep the ancestors for the sstable separately from the metadata collector
as the latter is about to be moved to the sstable writer.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use compaction::make_sstable_writer_config to pass
the compaction's `_sstable_level` to the writer
via sstable_writer_config, instead of via the sstable
metadata_collector, that is going to move from the sstable
to the write_impl.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use compaction::make_sstable_writer_config to pass
the compaction's replay_position (`_rp`) to the writer
via sstable_writer_config, instead of via the sstable
metadata_collector, that is going to move from the sstable
to the write_impl.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When Alternator is enabled over HTTPS - by setting the
"alternator_https_port" option - it needs to know some SSL-related options,
most importantly where to pick up the certificate and key.
Before this patch, we used the "server_encryption_options" option for that.
However, this was a mistake: Although it sounds like these are the "server's
options", in fact prior to Alternator this option was only used when
communicating with other servers - i.e., connections between Scylla nodes.
For CQL connections with the client, we used a different option -
"client_encryption_options".
This patch introduces a third option "alternator_encryption_options", which
controls only Alternator's HTTPS server. Making it separate from the
existing CQL "client_encryption_options" allows both Alternator and CQL to
be active at the same time but with different certificates (if the user
so wishes).
For backward compatibility, we temporarily continue to allow
server_encryption_options to control the Alternator HTTPS server if
alternator_encryption_options is not specified. However, this generates
a warning in the log, urging the user to switch. This temporary workaround
should be removed in a future version.
This patch also:
1. fixes the test run code (which has an "--https" option to test over
https) to use the new name of the option.
2. Adds documentation of the new option in alternator.md and protocols.md -
previously the information on how to control the location of the
certificate was missing from these documents.
Fixes#7204.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200930123027.213587-1-nyh@scylladb.com>
The wording on how CQL with SSL is configured was ambigous. Clarify the
text to explain that by default, it is *disabled*. We recommend to enable
it on port 9142 - but it's not a "default".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201014144938.653311-1-nyh@scylladb.com>
Consolidate the code to make the sstable_writer_config
for sstable writers into a helper method.
Folowing patches will add the ability to set the
replay position and sstable level via that config
structure.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To facilitate consolidation of components_writer and
some sstable methods into sstable_writer_k_l.
Refs #3012.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The patch adds a generic pretty-printer for `nonwrapping_interval<>`
(and `nonwrapping_range<>`) and a specific one for `dht::ring_position`.
Adding support to clustering and partition ranges is just one more step.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201014054013.606141-1-bdenes@scylladb.com>
Fixes#7424
AWS sdk (kinesis) assumes SequenceNumbers are monotonically
growing bigints. Since we sort on and use timeuuids are these
a "raw" bit representation of this will _not_ fulfill the
requirement. However, we can "unwrap" the timestamp of uuid
msb and give the value as timestamp<<64|lsb, which will
ensure sort order == bigint order.
Fixes#7409
AWS kinesis Java sdk requires/expects shards to be reported in
lexical order, and even worse, ignores lastevalshard. Thus not
upholding said order will break their stream intropection badly.
Added asserts to unit tests.
v2:
* Added more comments
* use unsigned_cmp
* unconditional check in streams_test
* seastar 35c255dcd...6973080cd (2):
> Merge "memory: improve memory diagnostics dumped on allocation failures" from Botond
> map_reduce: use get0 rather than get
Tests that the exceptional future returned by the serialized action
is propagated to trigger, reproducing #7352.
The test fails without the previoud patch:
"serialized_action: trigger: include also semaphore status to promise"
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, the serialized_action error is set to a shared_promise,
but is not returned to the caller, unless there is an
already outstanding action.
Note that setting the exception to the promise when noone
collected it via the shared_future caused 'Exceptional future ignored'
warning to be issued, as seen in #7352.
Fixes#7352
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, if `with_semaphore` returns exceptional future, it is not
propagated to the promise, and other waiters that got a shared
future will not see that.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
The reader concurrency semaphore timing out or its queue being overflown
are fairly common events both in production and in testing. At the same
time it is a hard to diagnose problem that often has a benign cause
(especially during testing), but it is equally possible that it points
to something serious. So when this error starts to appear in logs,
usually we want to investigate and the investigation is lengthy...
either involves looking at metrics or coredumps or both.
This patch intends to jumpstart this process by dumping a diagnostics on
semaphore timeout or queue overflow. The diagnostics is printed to the
log with debug level to avoid excessive spamming. It contains a
histogram of all the permits associated with the problematic semaphore
organized by table, operation and state.
Example:
DEBUG 2020-10-08 17:05:26,115 [shard 0] reader_concurrency_semaphore -
Semaphore _read_concurrency_sem: timed out, dumping permit diagnostics:
Permits with state admitted, sorted by memory
memory count name
3499M 27 ks.test:data-query
3499M 27 total
Permits with state waiting, sorted by count
count memory name
1 0B ks.test:drain
7650 0B ks.test:data-query
7651 0B total
Permits with state registered, sorted by count
count memory name
0 0B total
Total: permits: 7678, memory: 3499M
This allows determining several things at glance:
* What are the tables involved
* What are the operations involved
* Where is the memory
This can speed up a follow-up investigation greatly, or it can even be
enough on its own to determine that the issue is benign.
Tests: unit(dev, debug)
"
* 'dump-diagnostics-on-semaphore-timeout/v2' of https://github.com/denesb/scylla:
reader_concurrency_semaphore: dump permit diagnostics on timeout or queue overflow
utils: add to_hr_size()
reader_concurrency_semaphore: link permits into an intrusive list
reader_concurrency_semaphore: move expiry_handler::operator()() out-of-line
reader_concurrency_semaphore: move constructors out-of-line
reader_concurrency_semaphore: add state to permits
reader_concurrency_semaphore: name permits
querier_cache_test: test_immediate_evict_on_insert: use two permits
multishard_combining_reader: reader_lifecycle_policy: add permit param to create_reader()
multishard_combining_reader: add permit parameter
multishard_combining_reader: shard_reader: use multishard reader's permit
The reader concurrency semaphore timing out or its queue being overflown
are fairly common events both in production and in testing. At the same
time it is a hard to diagnose problem that often has a benign cause
(especially during testing), but it is equally possible that it points
to something serious. So when this error starts to appear in logs,
usually we want to investigate and the investigation is lengthy...
either involves looking at metrics or coredumps or both.
This patch intends to jumpstart this process by dumping a diagnostics on
semaphore timeout or queue overflow. The diagnostics is printed to the
log with debug level to avoid excessive spamming. It contains a
histogram of all the permits associated with the problematic semaphore
organized by table, operation and state.
Example:
DEBUG 2020-10-08 17:05:26,115 [shard 0] reader_concurrency_semaphore -
Semaphore _read_concurrency_sem: timed out, dumping permit diagnostics:
Permits with state admitted, sorted by memory
memory count name
3499M 27 ks.test:data-query
3499M 27 total
Permits with state waiting, sorted by count
count memory name
1 0B ks.test:drain
7650 0B ks.test:data-query
7651 0B total
Permits with state registered, sorted by count
count memory name
0 0B total
Total: permits: 7678, memory: 3499M
This allows determining several things at glance:
* What are the tables involved
* What are the operations involved
* Where is the memory
This can speed up a follow-up investigation greatly, or it can even be
enough on its own to determine that the issue is benign.
This utility function converts a potentially large number to a compact
representation, composed of at most 4 digits and a letter appropriate to
the power of two the number has to multiplied with to arrive to the
original number (with some loss of precision).
The different powers of two are the conventional 2 ** (N * 10) variants:
* N=0: (B)ytes
* N=1: (K)bytes
* N=2: (M)bytes
* N=3: (G)bytes
* N=4: (T)bytes
Examples:
* 87665 will be converted to 87K
* 1024 will be converted to 1K
Instead of a simple boolean, designating whether the permit was already
admitted or not, add a proper state field with a value for all the
different states the permit can be in. Currently there are three such
states:
* registered - the permit was created and started accounting resource
consumption.
* waiting - the permit was queued to wait for admission.
* admitted - the permit was successfully admitted.
The state will be used for debugging purposes, both during coredump
debugging as well as for dumping diagnostics data about permits.
Require a schema and an operation name to be given to each permit when
created. The schema is of the table the read is executed against, and
the operation name, which is some name identifying the operation the
permit is part of. Ideally this should be different for each site the
permit is created at, to be able to discern not only different kind of
reads, but different code paths the read took.
As not all read can be associated with one schema, the schema is allowed
to be null.
The name will be used for debugging purposes, both for coredump
debugging and runtime logging of permit-related diagnostics.
scylla-python3 causes segfault when non-default locale specified.
As workaround for this, we need to set LC_ALL=en_US.UTF_8 on python3 thunk.
Fixes#7408Closes#7414
Make the warning message clearer:
* Include the number of partitions affected by the batch.
* Be clear that the warning is about the batch size in bytes.
Fixes#7367
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Closes#7417
Without this, clang complains that we violate argument dependent
lookup rules:
note: 'operator<<' should be declared prior to the call site or in namespace 'sstables'
std::ostream& operator<<(std::ostream&, const sstables::component_type&);
we can't enforce the #include order, but we can easily move it it
to namespace sstables (where it belongs anyway), so let's do that.
gcc is happy either way.
Closes#7413
bound_kind_m's formatter violates argument dependent lookup rules
according to clang, so fix that. Along the way improve the formatter a
little.
Closes#7412
* git://github.com/avikivity/scylla.git avikivity-bound_kind_m-formatter:
sstables: move bound_kind_m formatter to namespace sstables
sstables: move bound_kind_m formatter to its natural place
sstables: deinline bound_kind_m formatter
Without this, clang complains that we violate argument dependent
lookup rules:
note: 'operator<<' should be declared prior to the call site or in namespace 'sstables'
std::ostream& operator<<(std::ostream&, const sstables::bound_kind_m&);
we can't enforce the #include order, but we can easily move it it
to namespace sstables (where it belongs anyway), so let's do that.
gcc is happy either way.
Move bound_kind_m's formatter to the same header file where
is is defined. This prevents cases where the compiler decays
the type (an enum) to the underlying integral type because it
does not see the formatter declaration, resulting in the wrong
output.
When there are hint files to be sent and the target endpoint is DOWN,
end_point_hints_manager works in the following loop:
- It reads the first hint file in the queue,
- For each hint in the file it decides that it won't be sent because the
target endpoint is DOWN,
- After realizing that there are some unsent hints, it decides to retry
this operation after sleeping 1 second.
This causes the first segment to be wholly read over and over again,
with 1 second pauses, until the target endpoint becomes UP or leaves the
cluster. This causes unnecessary I/O load in the streaming scheduling
group.
This patch adds a check which prevents end_point_hints_manager from
reading the first hint file at all when it is not allowed to send hints.
First observed in #6964
Tests:
- unit(dev)
- hinted handoff dtests
Closes#7407
The test currently uses a single permit shared between two simulated
reads (to wait admission twice). This is not a supported way of using a
permit and will stop working soon as we make the states the permit is in
more pronounced.
Allow the evictable reader managing the underlying reader to pass its
own permit to it when creating it, making sure they share the same
permit. Note that the two parts can still end up using different
permits, when the underlying reader is kept alive between two pages of a
paged read and thus keeps using the permit received on the previous
page.
Also adjust the `reader_context` in multishard_mutation_query.cc to use
the passed-in permit instead of creating a new one when creating a new
reader.
Don't create an own permit, take one as a parameter, like all other
readers do, so the permit can be provided by the higher layer, making
sure all parts of the logical read use the same permit.
Don't create a new permit per shard reader, pass down the multishard
reader's one to be used by each shard reader. They all belong to the
same read, they should use the same permit. Note that despite its name
the shard readers are the local representation of a reader living on a
remote shard and as such they live on the same shard the multishard
combining reader lives on.
Clang parses templates more eagerly than gcc, so it fails on
some forward-declared templates. In this case, value_writer
was forward-declared and then used in data::cell. As it also
uses some definitions local to data::cell, it cannot be
defined before it as well as after it.
To solve the problem, we define it as a nested class so
it can use other local definitions, yet be defined before it
is used. No code changes.
Closes#7401
A few source files (like those generated by antlr) don't build with seastar,
and so don't inherit all of its flags. They then use the compiler default dialect,
not C++20. With gcc that's just fine, since gcc supports concepts in earlier dialects,
but clang requires C++20.
Fix by forcing --std=gnu++20 for all files (same as what Seastar chooses).
Closes#7392
The remains of the defunct #7246.
Fixes#7344Fixes#7345Fixes#7346Fixes#7347
Shard ID length is now within limits.
Shard end sequence number should be set when appropriate.
Shard parent is selected a bit more carefully (sorting)
Shards are filtered by time to exclude cdc generations we cannot get data from (too old)
Shard paging improved
Closes#7348
* github.com:scylladb/scylla:
test_streams: Add some more sanity asserts
alternator::streams: Set dynamodb data TTL explicitly in cdc options
alternator::streams: Improve paging and fix parent-child calculation
alternator::streams: Remove table from shard_id
alternator::streams: Filter our cdc streams older than data/table
alternator::error: Add a few dynamo exception types
"
max_concurrent_for_each was added to seastar for replacing
sstable_directory::parallel_for_each_restricted by using
more efficient concurrency control that doesn't create
unlimited number of continuations.
The series replaces the use of sstable_directory::parallel_for_each_restricted
with max_concurrent_for_each and exposes the sstable_directory::do_for_each_sstable
via a static method.
This method is used here by table::snapshot to limit concurrency
do snapshot operations that suffer from the same unbound
concurrency problem sstable_directory solved.
In addition sstable_directory::_load_semaphore that was used
across calls to do_for_each_sstable was replaced by a static per-shard
semaphore that caps concurrency across all calls to `do_for_each_sstable`
on that shard. This makes sense since the disk is a shared resource.
In the future, we may want to have a load semaphore per device rather than
a single global one. We should experiment with that.
Test: unit(dev)
"
* tag 'max_concurrent_for_each-v5' of github.com:bhalevy/scylla:
table: snapshot: use max_concurrent_for_each
sstable_directory: use a external load_semaphore
test: sstable_directory_test: extract sstable_directory creation into with_sstable_directory
distributed_loader: process_upload_dir: use initial_sstable_loading_concurrency
sstables: sstable_directory: use max_concurrent_for_each
The build with clang fails with
ld.lld: error: undefined symbol: icu_65::Collator::createInstance(icu_65::Locale const&, UErrorCode&)
>>> referenced by like_matcher.cc
>>> build/dev/utils/like_matcher.o:(boost::re_detail_106900::icu_regex_traits_implementation::icu_regex_traits_implementation(icu_65::Locale const&))
>>> referenced by like_matcher.cc
>>> build/dev/utils/like_matcher.o:(boost::re_detail_106900::icu_regex_traits_implementation::icu_regex_traits_implementation(icu_65::Locale const&))
That symbol lives in libicui18n. It's not clear why clang fails to resolve it and gcc succeeds (after all,
both use lld as the linker) but it is easier to add the library than to attempt to figure out the
discrepancy.
Closes#7391
Clang has some difficulty with the boost::cpp_int constructor from string_view.
In fact it is a mess of enable_if<>s so a human would have trouble too.
Work around it by converting to std::string. This is bad for performance, but
this constructor is not going to be fast in any case.
Hopefully a fix will arrive in clang or boost.
Closes#7389
Following ad48d8b43c, fix a similar problem which popped up
with higher inlining thresholds in query-result-writer.hh. Since
idl/query depends on idl/keys, it must follow in definition order.
Closes#7384
We add std::distance(...) + 1 to a vector iterator, but
the vector can be empty, so we're adding a non-zero value
to nullptr, which is undefined behavior.
Rearrange to perform the limit (std::min()) before adding
to the pointer.
Found by clang's ubsan.
Closes#7377
database_test contains several instances of calling do_with_cql_test_env()
with a function that expects to be called in a thread. This mostly works
because there is an internal thread in do_with_cql_test_env(), but is not
guaranteed to.
Fix by switching to the more appropriate do_with_cql_test_env_thread().
Closes#7333
The variable 'observer' (an std::optional) may be left uninitialized
if 'incremental_enabled' is false. However, it is used afterwards
with a call to disconnect, accessing garbage.
Fix by accessing it via the optional wrapper. A call to optional::reset()
destroys the observable, which in turn calls disconnect().
Closes#7380
get_nt_build_id() constructs a pointer by adding a base and an
offset, but if the base happens to be zero, that is undefined
under C++ rules (altough legal ELF).
Fix by performing the addition on integers, and only then
casting to a pointer.
Closes#7379
libstdc++'s std::regex uses recursion[1], with a depth controlled by the
input. Together with clang's debug mode, this overflows the stack.
Use boost::regex instead, which is immune to the problem.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86164Closes#7378
b1e78313fe added a check for ubsan to squelch a false positive,
but that check doesn't work with clang. Relax it to check for debug
mode, so clang doesn't hit the same false positive as gcc did.
Define a SANITIZE macro so we have a reliable way to detect if
we're running with a sanitizer.
Closes#7372
d2t() scales a fraction in the range [0, 1] to the range of
a biased token (same as unsigned long). But x86 doesn't support
conversion to unsigned, only signed, so this is a truncating
conversion. Clang's ubsan correctly warns about it.
Fix by reducing the range before converting, and expanding it
afterwards.
Closes#7376
Some test (run_based_compaction_test at least) use _max_sstable_size = 0
in order to force one partition per sstable. That triggers an overflow
when calculating the expected bloom filter size. The overflow doesn't
matter for normal operation, because the result later appears on a
divisor, but does trigger a ubsan error.
Squelch the error by bot dividing by zero here.
I tried using _max_sstable_size = 1, but the test failed for other
reasons.
Closes#7375
When converting a value to its Lua representation, we choose
an integer type if it fits. If it doesn't, we fall back to a
more expensive type. So we explicitly try to trigger an overflow.
However, clang's ubsan doesn't like the overflow, and kills the
test. Tell it that the overflow is expected here.
Closes#7374
Our AVX2 implementation cannot load a partial vector,
or mask unused elements (that can be done with AVX-512/SVE2),
so it has some restrictions. Document them.
Closes#7385
Offsetting a null pointer is undefined, and clang's ubsan complains.
Rearrange the arithmetic so we never offset a null pointer. A function
is introduced for the remaining contiguous bytes so it can cast the result
to size_t, avoiding a compare-of-different-signedness warning from gcc.
Closes#7373
When the number of nanosecond digits is greater than 9, the std::pow()
expression that corrects the nanosecond value becomes infinite. This
is because sstring::length() is unsigned, and so negative values
underflow and become large.
Following Cassandra, fix by forbidding more than 9 digits of
nanosecond precision.
Found by clang's ubsan.
Closes#7371
Following 5d249a8e27, apply the same
fix for user_type_impl.
This works around https://bugs.llvm.org/show_bug.cgi?id=47747
Depending on this might be unstable, as the bug bug can show up at any
corner, but this is sufficient right now to get
test_user_function_disabled to pass.
Closes#7370
Our cql parser uses large amounts of stack, and can overflow it
in debug mode with clang. To prevent this stack overflow,
temporarily use a larger (1MB) stack.
Closes#7369
* github.com:scylladb/scylla:
cql3: use larger stack for do_with_cql_parser() in debug mode
cql3: deinline do_with_cql_parser()
Clang has a bug processing inline ifuncs with intrinsics[1].
Since ifuncs can't be inlined anyway (they are always dispatched
via a function pointer that is determined based on the CPU
features present), nothing is gained by inlining them. Deinlining
therefore reduces compile time and works around the clang bug.
[1] https://bugs.llvm.org/show_bug.cgi?id=47691Closes#7358
Our cql parser uses large amounts of stack, and can overflow it
in debug mode with clang. To prevent this stack overflow,
temporarily use a larger (1MB) stack.
We can't use seastar::thread(), since do_with_cql_parser() does
not yield. We can't use std::thread(), since lw_shared_ptr()'s
debug mode will scream murder at an lw_shared_ptr used across
threads (even though it's perfectly safe in this case). We
can't use boost::context2 since that requires the library to
be compiled with address sanitizer support, which it isn't on
Fedora. So we use a fiber switch using the getcontext() function
familty. This requires extra annotations for debu mode, which are
added.
The cql parser causes trouble with the santizers and clang,
since it consumes a large amount of stack space (it does so
with gcc too, but does not overflow our 128k stacks). In
preparation for working around the problem, deinline it
so the hacks need not spread to the entire code base
via #include.
There is no performance impact from the virtual function,
as cql parsing will dominate the call.
Raft tests with declarative structure instead of procedural.
* https://github.com/alecco/scylla/tree/raft-ale-tests-03d:
raft: log failed test case name
raft: test add hasher
raft: declarative tests
raft: test make app return proper exit int value
raft: test add support for disconnected server
raft: tests use custom server ids for easier debugging
raft: make election_elapsed public for testing
raft: test remove unnecessary header
raft: fix typo snaphot snapshot
Values seen by nodes were so far added but this does not provide a
guarantee the order of these values was respected.
Use a digest to check output, implicitly checking order.
On the other hand, sum or a simple positional checksum like Fletcher's
is easier to debug as rolling sum is evident.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
For convenience making Raft tests, use declarative structures.
Servers are set up and initialized and then updates are processed.
For now, updates are just adding entries to leader and change of leader.
Updates and leader changes can be specified to run after initial test setup.
An example test for 3 nodes, node 0 starting as leader having two entries
0 and 1 for term 1, and with current term 2, then adding 12 entries,
changing leader to node 1, and adding 12 more entries. The test will
automatically add more entries to the last leader until the test limit
of total_values (default 100).
{.name = "test_name", .nodes = 3, .initial_term = 2,
.initial_states = {{.le = {{1,0},{1,1}}},
.updates = {entries{12},new_leader{1},entries{12}},},
Leader is isolated before change via is_leader returning false.
Initial leader (default server 0) will be set with this method, too.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The get_range_to_endpoint_map method, takes a keyspace and returns a map
between the token ranges and the endpoint.
It is used by some external tools for repair.
Token ranges are codes as size-2 array, if start or end are empty, they will be
added as an empty string.
The implementation uses get_range_to_address_map and re-pack it
accordingly.
The use of stream_range_as_array it to reduce the risk of large
allocations and stalls.
Relates to scylladb/scylla-jmx#36
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#7329
Tables may have thousands of sstables and a number of component files
for each sstables. Using parallel_for_each on all sstables (and
parallel_for_each in sstables::create_links for each file)
needlessly overloads the system with unbounded number of continuations.
Use max_concurrent_for_each and acquire the db sst_dir_semaphore to
limit parallelism.
Note that although snapshot is called while scylla already
loaded the sstable we use the configured initial_sstable_loading_concurrency().
As a future follow-up we may want to define yet another config
variable for on-going operations on sstable directories
if we see that it warrants a diffrent setting than the initial
loading concurrency.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Although each sstable_directory limits concurrency using
max_concurrent_for_each, there could be a large number
of calls to do_for_each_sstable running in parallel
(e.g per keyspace X per table in the distributed_loader).
To cap parallelism across sstable_directory instances and
concurrent calls to do_for_each_sstable, start a sharded<semaphore>
and pass a shared semaphore& to the sstable_directory:s.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use common code to create, start, and stop the sharded<sstable_directory>
for each test.
This will be used in the next patch for creating a sharded semaphore
and passing it to the sstable_directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
shutil.rmtree(ignore_errors=True) was for ignores error when directory not exist,
but it also ignores permission error, so we shouldn't use that.
Run os.path.exists() before shutil.rmtree() instead.
Fixes#7337Closes#7338
This series prepares the build system for clang support. It deals with
the different sets of warnings accepted by clang and gcc, and with
detecting clang 10 as a supported compiler.
It's still not possible to build with clang after this, but we're
another step closer.
Closes#7269
* github.com:scylladb/scylla:
build: detect and allow clang 10 as a compiler
build: detect availablity of -Wstack-usage=
build: disable many clang-specific warnings
rapidjson has a harmless (but true) ubsan violation. It was fixed
in 16872af889.
Since rapidjson has't released since 2016, we're unlikely to see
the fix, so suppress it to prevent the tests failing. In any case
the violation is harmless.
gcc's ubsan doesn't object to the addition.
Closes#7357
Send more that one entry in single append_entry message but
limit one packets size according to append_request_threshold parameter.
Message-Id: <20201007142602.GA2496906@scylladb.com>
Updated for the ability to add group names to SMP service groups
(https://github.com/scylladb/seastar/pull/809).
* seastar 8c8fd3ed...c62c4a3d (3):
> smp service group: add optional group name
> dpdk: mark link_ready() function override
> Merge "sharded: make start, stop, and invoke_on methods noexcept" from Benny
Fix a race condition in on_compaction_completion() that can prevent shutdown,
as well as an exception handling error. See individual patches for details.
Fixes#7331.
Closes#7334
* github.com:scylladb/scylla:
table: fix mishandled _sstable_deleted_gate exception in on_compaction_completion
table: fix on_compaction_completion corrupting _sstables_compacted_but_not_deleted during self-race
Although process_upload_dir is not called when initially loading
the tables, but rather from from storage_service::load_new_sstables,
it can use the same sstable_loading_concurrency, rather than constant `4`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use max_concurrent_for_each instead of parallel_for_each in
sstable_directory::parallel_for_each_restricted to avoid
creating potentially thousands of continuations,
one for each sstable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes#7345Fixes#7346
Do a more efficient collection skip when doing paging, instead of
iterating the full sets.
Ensure some semblance of sanity in the parent-child relationship
between shards by ensuring token order sorting and finding the
apparent previous ID coverting the approximate range of new gen.
Fix endsequencenumber generation by looking at whether we are
last gen or not, instead of the (not filled in) 'expired' column.
Fixes#7344
It is not data really needed, as shard_id:s are not required
to be unique across streams, and also because the length limit
on shard_id text representation.
As a side effect, shard iter instead carries the stream arn.
Instead of eagerly linearizing all values as they are passed to
validate(), defer linearization to those validators that actually need
linearized values. Linearizing large values puts pressure on the memory
allocator with large contiguous allocation requests. This is something
we are trying to actively avoid, especially if it is not really neaded.
Turns out the types, whose validators really want linearized values are
a minority, as most validators just look at the size of the value, and
some like bytes don't need validation at all, while usually having large
values.
This is achieved by templating the validator struct on the view and
using the FragmentedRange concept to treat all passed in views
(`bytes_view` and `fragmented_temporary_buffer_view`) uniformly.
This patch makes no attempt at converting existing validators to work
with fragmented buffers, only trivial cases are converted. The major
offenders still left are ascii/utf8 and collections.
Fixes: #7318
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201007054524.909420-1-bdenes@scylladb.com>
After adding a new node to the cluster, Scylla sends a NEW_NODE event
to CQL clients. Some clients immediately try to connect to the new node,
however it fails as the node has not yet started listening to CQL
requests.
In contrast, Apache Cassandra waits for the new node to start its CQL
server before sending NEW_NODE event. In practice this means that
NEW_NODE and UP events will be sent "jointly" after new node is UP.
This change is implemented in the same manner as in Apache Cassandra
code.
Fixes#7301.
Closes#7306
Fixes#7347
If cdc stream id:s are older than either table creation or now - 24h
we can skip them in describe_stream, to minimize the amount of
shards being returned.
The username becomes known in the course of resolving challenges
from `PasswordAuthenticator`. That's why username is being set on
successful authentication; until then all users are "anonymous".
Meanwhile, `AllowAllAuthenticator` (the default) does not request
username, so users logged with it will remain as "anonymous" in
`system.clients`.
Shuffling of code was necessary to unify existing infrastructure
for INSERTing entries into `system.clients` with later UPDATEs.
"
There are few places left that call for global query processor instance,
the tracing is one of them.
The query pressor is used mainly in table_helper, so this set mostly
shuffles its methods' arguments to deliver the needed reference. At the
end the main.cc code is patched to provide the query processor, which
is still global and not stopped, and is thus safe to be used anywhere.
tests: unit(dev), dtest(cql_tracing:dev)
"
* 'br-tracing-vs-query-processor' of https://github.com/xemul/scylla:
tracing: Keep qp anchor on backend
tracing: Push query processor through init methods
main: Start tracing in main
table_helper: Require local query processor in calls
table_helper: Use local qp as setup_table argument
table_helper: Use local db variable
"
The querier cache has a memory based eviction mechanism, which starts
evicting freshly inserted queriers once their collective memory
consumption goes above the configured limit. For determining the memory
consumption of individual queriers, the querier cache uses
`flat_mutation_reader::buffer_size()`. But we now have a much more
comprehensive accounting of the memory used by queriers: the reader
permit, which also happens to be available in each querier. So use this
to determine the querier's memory consumption instead.
Tests: unit(dev)
"
* 'querier-cache-use-permit-for-memory-accounting/v1' of https://github.com/denesb/scylla:
flat_mutation_reader: de-virtualize buffer_size()
querier_cache: use the reader permit for memory accounting
querier_cache_test: use local semaphore not the test global one
reader_permit: add consumed_resources() accessor
The query processor is required in table_helper's used by tracing. Now
everything is ready to push the query processor reference from main down
to the table helpers.
Because of the current initialization sequence it's only possible to have
the started query processor at the .start_tracing() time. Earlier, when
the sharded<tracing> is started the query processor is not yet started,
so tracing keeps a pointer on local query processor.
When tracing is stopped, the pointer is null-ed. This is safe (but an
assert is put when dereferencing it), because on stop trace writes' gate
is closed and the query processor is only used in them.
Also there's still a chance that tracing remains started in case of start
abort, but this is on-par with the current code -- sharded query processor
is not stopped, so the memory is not freed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to make tracing keyspace helper reference query processor, so this
patch adds the needed arguments through the initialization stack.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Move the tracing::start_tracing() out of the storage_service::join_cluster. It
anyway happens at the end of the join, so the logic is not changed, but it
becomes possible to patch tracing further.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Keeping the query processor reference on the table_helper in raii manner
seems waistful, the only user of it -- the trace_keyspace_helper -- has
a bunch of helpers on board, each would then keep its own copy for no
gain.
At the same time the trace_keyspace_helper already gets the query processor
for its needs, so it can share one with table_helper-s.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to make table_helper API require the query_processor
reference and use it where needed. The .setup_table() is private
method, and still grabs the query processor reference itself. Since
its futures do noth reshard, it's safe to carry the query processor
reference through.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test was broken by recent sstables manager rework. In the middle
the sstables::test_env is destroyed without being closed which leads
to broken _closing assertion inside ~sstables_manager().
Fix is to use the test_env::do_with helper.
tests: perf.memory_footprint
* https://github.com/xemul/scylla/tree/br-memory-footprint-test-fix:
test/perf/memory_footprint: Fix indentation after previous patch
test/perf/memory_footprint: Don't forget to close sstables::test_env after usage
After recent sstables manager rework the sstables::test_env must be .close()d
after usage, otherwise the ~sstables_mananger() hits the _closing assertion.
Do it with the help of .do_with(). The execution context is already seastar::async
in this place, so .get() it explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In some cases a collection is used to keep several elements,
so it's good to know this timing.
For example, a mutation_partition keeps a set of rows, if used
in cache it can grow large, if used in mutation to apply, it's
typically small. Plain replacement of bst into b-tree caused
performance degardation of mutation application because b-tree
is only better at big sizes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This collection is widely used, any replacement should be
compared against it to better understand pros-n-cons.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
on_compaction_completion tries to handle a gate_closed_exception, but
with_gate() throws rather than creating an exceptional future, so
the extra handling is lost. This is relatively benign since it will
just fail the compaction, requiring that work to be redone later.
Fix by using the safer try_with_gate().
on_compaction_completion() updates _sstables_compacted_but_not_deleted
through a temporary to avoid an exception causing a partial update:
1. copy _sstables_compacted_but_not_deleted to a temporary
2. update temporary
3. do dangerous stuff
4. move temporary to _sstables_compacted_but_not_deleted
This is racy when we have parallel compactions, since step 3 yields.
We can have two invocations running in parallel, taking snapshots
of the same _sstables_compacted_but_not_deleted in step 1, each
modifying it in different ways, and only one of them winning the
race and assigning in step 4. With the right timing we can end
with extra sstables in _sstables_compacted_but_not_deleted.
Before a5369881b3, this was a benign race (only resulting in
deleted file space not being reclaimed until the service is shut
down), but afterwards, extra sstable references result in the service
refusing to shut down. This was observed in database_test in debug
mode, where the race more or less reliably happens for system.truncated.
Fix by using a different method to protect
_sstables_compacted_but_not_deleted. We unconditionally update it,
and also unconditionally fix it up (on success or failure) using
seastar::defer(). The fixup includes a call to rebuild_statistics()
which must happen every time we touch the sstable list.
Fixes#7331.
The main user of this method, the one which required this method to
return the collective buffer size of the entire reader tree, is now
gone. The remaining two users just use it to check the size of the
reader instance they are working with.
So de-virtualize this method and reduce its responsibility to just
returning the buffer size of the current reader instance.
The querier cache has a memory limit it enforces on cached queriers. For
determining how much memory each querier uses, it currently uses
`flat_mutation_reader::buffer_size()`. However, we now have a much more
complete accounting of the memory each read consumes, in the form of the
reader permit, which also happens to be handy in the queriers. So use it
instead of the not very well maintained `buffer_size()`.
In the mutation source, which creates the reader for this test, the
global test semaphore's permit was passed to the created reader
(`tests::make_permit()`). This caused reader resources to be accounted
on the global test semaphore, instead of the local one the test creates.
Just forward the permit passed to the mutation sources to the reader to
fix this.
Merged patch series by Tomasz Grabiec:
UBSAN complains in debug mode when the counter value overflows:
counters.hh:184:16: runtime error: signed integer overflow: 1 + 9223372036854775807 cannot be represented in type 'long int'
Aborting on shard 0.
Overflow is supposed to be supported. Let's silence it by using casts.
Fixes#7330.
Tests:
- build/debug/test/tools/cql_repl --input test/cql/counters_test.cql
Tomasz Grabiec (2):
counters: Avoid signed integer overflow
test: cql: counters: Add tests reproducing signed integer overflow in
debug mode
counters.hh | 2 +-
test/cql/counters_test.cql | 9 ++++++++
test/cql/counters_test.result | 48 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 58 insertions(+), 1 deletion(-)
UBSAN complains in debug mode when the counter value overflows:
counters.hh:184:16: runtime error: signed integer overflow: 1 + 9223372036854775807 cannot be represented in type 'long int'
Aborting on shard 0.
Overflow is supposed to be supported.
Let's silence it by using casts.
Fixes#7330.
Server address UUID 0 is not a valid server id since there is code that
assumes if server_id is 0 the value is not set (e.g _voted_for).
Prevent users from manually setting this invalid value.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
When AppendEntries/InstallSnapshot term is the same as the current
server's, and the current servers' leader is not set, we should
assign it to avoid starting election if the current leader becomes idle.
Restructure the code accordingly - change candidate state to Follower
upon InstallSnapshot.
It was erroneously replaced by the logic based on time which caused us
to send one probe per tick which is not an intention at all. There can
be one outstanding probe message but the moment it gets a reply next
one should be sent without waiting for a tick.
"
Misc. cleanups and minor optimizations of token_metadata methods
in preparation to futurizing parts of the api around
update_pending_ranges and abstract_replication_strategy::calculate_natural_endpoints,
to prevent reactor stalls on these paths
Test: unit(dev)
"
* 'token_metadata_cleanup' of github.com:bhalevy/scylla:
token_metadata: get rid of unused calculate_pending_ranges_for_* methods
token_metadata: get rid of clone_after_all_settled
token_metadata_impl: remove_endpoint: do not sort tokens
token_metadata_impl: always sort_tokens in place
On some environment such as CentOS8, journalctl --user -xe does not work
since journald is running in volatile mode.
The issue cannnot fix in non-root mode, as a workaround we should logging to
a file instead of journal.
Also added scylla_logrotate to ExecStartPre which rename previous log file,
since StandardOutput=file:/path/to/file will erase existing file when service
restarted.
Fixes#7131Closes#7326
Instead of relying on SCYLLA-VERSION-GEN to be consistently
updated in each submodule, propagate the top-level product-version-release
to all submodules. This reduces the churn required for each release,
and makes the release strings consistent (previously, the git hash in each
was different).
Closes#7268
Alternator does not yet support direct access to nested attributes in
expressions (this is issue #5024). But it's still good to have tests
covering this feature, to make it easier to check the implementation
of this feature when it comes.
Until now we did not have tests for using nested attributes in
*FilterExpression*. This patch adds a test for the straightforward case,
and also adds tests for the more elaborate combination of FilterExpression
and ProjectionExpression. This combination - see issue #6951 - means that
some attributes need to be retrieved despite not being projected (because
they are needed in a filter). When we support nested attributes there will
be special cases when the projected and filtered attributes are parts of
the same top-level attribute, so the code will need to handle those cases
correctly. As I was working on issue #6951 now, it is a good time to write
a test for these special cases, even if nested attributes aren't yet
supported - so we don't forget to handle these special cases later.
Both new tests pass on DynamoDB, and xfail on Alternator.
Refs #5024 (nested attributes)
Refs #6951 (FilterExpression with ProjectionExpression)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
A comment in test/alternator/test_lsi.py wrongly described the schema
of one of the test tables. Fix that comment.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch provides two more tests for issue #6951. As this issue was
already fixed, the two new tests pass.
The two new test check two special cases for which were handled correctly
but not yet tested - when the projected attribute is a key attribute of
the table or of one of its LSIs. Having these two additional tests will
ensure that any future refactoring or optimizations in the this area of
the code (filtering, projection, and its combination) will not break these
special cases.
Refs #6951.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The DynamoDB API has for the Query and Scan requests two filtering
syntaxes - the old (QueryFilter or ScanFilter) and the new (FilterExpression).
Also for projection, it has an old syntax (AttributesToGet) and a new
one (ProjectionExpression). Combining an old-style and new-style parameter
is forbidden by DynamoDB, and should also be forbidden by Alternator.
This patch fixes, and removes the "xfails" tag, of two tests:
test_query_filter.py::test_query_filter_and_projection_expression
test_filter_expression.py::test_filter_expression_and_attributes_to_get
Refs #6951
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We had a bug when a Query/Scan had both projection (ProjectionExpression
or AttributesToGet) and filtering (FilterExpression or Query/ScanFilter).
The problem was that projection left only the requested attributes, and
the filter might have needed - and not got - additional attributes.
The solution in this patch is to add the generated JSON item also
the extra attributes needed by filtering (if any), run the filter on
that, and only at the end remove the extra filtering attributes from
the item to be returned.
The two tests
test_query_filter.py::test_query_filter_and_attributes_to_get
test_filter_expression.py::test_filter_expression_and_projection_expression
Which failed before this patch now pass so we drop their "xfail" tag.
Fixes#6951.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
* seastar 292ba734bc...8c8fd3ed28 (15):
> semaphore_units: add return_units and return_all
> semaphore_units: release: mark as noexcept
> circular_buffer: support non-default-constructible allocators correctly
> core/shared_ptr: Expose use count through {lw_}enable_shared_from_this
> memory: align allocations to std::max_align_t
> util/log: logger::do_log(): go easier on allocations
> doc: add link to multipage version of tutorial
> doc: fix the output directories of split and tutorial.html
> build: do not copy htmlsplit.py to build dir
> doc: add "--input" and "--output-dir" options to htmlsplit.py
> doc: update split script to use xml.etree.ElementTree
> Merge "shared_future: make functions noexcept" from Benny
> tutorial: add linebreak between sections
> tutorial: format "future<int>" as inline code block
> docs: specify HTML language code for tutorial.html
This patch change the way the request object is passed,
using a reference instead of temporaries.
'exists' test is passing in debug mode, whereas it was
always failing before.
Fixes#7261 by ensuring request object is alive for all commands
during the whole request duration.
Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200924202034.30399-1-etienne.adam@gmail.com>
Lua 5.4 added an extra parameter to lua_resume()[1]. The parameter
denotes the number of arguments yielded, but our coroutines
don't yield any arguments, so we can just ignore it.
Define a macro to allow adding extra stuff with Lua 5.4,
and use it to supply the extra parameter.
[1] https://www.lua.org/manual/5.4/manual.html#8.3Closes#7324
Clang eagerly instantiates templates, apparently with the following
algorithm:
- if both the declaration and definition are seen at the time of
instantiation, instantiate the template
- if only the declaration is see at the time of instantiation, just emit
a reference to the template; even if the definition is later seen,
it is not instantiated
The "reference" in the second case is a relocation entry in the object file
that is satisfied at link time by the linker, but if no other object file
instantiated the needed template, a link error results.
These problems are hard to diagnose but easy to fix. This series fixes all
known such issues in the code base. It was tested on gcc as well.
Closes#7322
* github.com:scylladb/scylla:
query-result-reader: order idl implementations correctly
frozen_schema: order idl implementations correctly
idl-compiler: generate views after serializers
On Ubuntu 16/18 and Debian 9, LimitNOFILE is set to 4096 and not able to override from
user unit.
To run scylla-server in such environment, we need to turn on developer mode and show
warnings.
Fixes#7133Closes#7323
Clang eagerly instantiates templates, so if it needs a template
function for which it has a declaration but not a definition, it
will not instantiate the definition when it sees it. This causes
link errors.
Fix by ordering the idl implementation files so that definitions
come before uses.
Clang eagerly instantiates templates, so if it needs a template
function for which it has a declaration but not a definition, it
will not instantiate the definition when it sees it. This causes
link errors.
Fix by ordering the idl implementation files so that definitions
come before uses.
Clang eagerly instantiates templates, so if it needs a template
function for which it has a declaration but not a definition, it
will not instantiate the definition when it sees it. This causes
link errors.
In this case, the views use the serializer implementations, but are
generated before them.
Fix by generating the view implementations after the serializer
implementations that they use.
Add disable option for test configuration.
Tests in this list will be disabled for all modes.
* alejo/next-disable-raft-tests-01:
Raft: disable boost tests for now
Tests: add disable to configuration
Raft: Remove tests for now
For suite.yaml add an extra configuration option disable.
Tests in this list will disabled for all modes.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Remove raft C++ tests until raft is included in build process.
[tgrabiec]: Fixes test.py failure. Tests are not compiled unless --build-raft is
passed to configure.py and we cannot enable it by default yet.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20201002102847.1140775-1-alejo.sanchez@scylladb.com>
Unavailable exception means that operation was not started and it can be
retried safely. If lwt fails in the learn stage though it most
certainly means that its effect will be observable already. The patch
returns timeout exception instead which means uncertainty.
Fixes#7258
Message-Id: <20201001130724.GA2283830@scylladb.com>
This is the beginning of raft protocol implementation. It only supports
log replication and voter state machine. The main difference between
this one and the RFC (besides having voter state machine) is that the
approach taken here is to implement raft as a deterministic state
machine and move all the IO processing away from the main logic.
To do that some changes to RPC interface was required: all verbs are now
one way meaning that sending a request does not wait for a reply and
the reply arrives as a separate message (or not at all, it is safe to
drop packets).
* scylla-dev/raft-v4:
raft: add a short readme file
raft: compile raft tests
raft: add raft tests
raft: Implement log replication and leader election
raft: Introduce raft interface header
Compilation is not enabled by default as it requires coroutines support
and may require special compiler (until distributed one fixes all the
bugs related to coroutines). To enable raft tests compilation new
configure.py option is added (--build-raft).
Add test for currently implemented raft features. replication_test
tests replication functionality with various initial log configurations.
raft_fsm_test test voting state machine functionality.
This patch introduces partial RAFT implementation. It has only log
replication and leader election support. Snapshotting and configuration
change along with other, smaller features are not yet implemented.
The approach taken by this implementation is to have a deterministic
state machine coded in raft::fsm. What makes the FSM deterministic is
that it does not do any IO by itself. It only takes an input (which may
be a networking message, time tick or new append message), changes its
state and produce an output. The output contains the state that has
to be persisted, messages that need to be sent and entries that may
be applied (in that order). The input and output of the FSM is handled
by raft::server class. It uses raft::rpc interface to send and receive
messages and raft::storage interface to implement persistence.
This commit introduce public raft interfaces. raft::server represents
single raft server instance. raft::state_machine represents a user
defined state machine. raft::rpc, raft::rpc_client and raft::storage are
used to allow implementing custom networking and storage layers.
A shared failure detector interface defines keep-alive semantics,
required for efficient implementation of thousands of raft groups.
Recently, the cql_server_config::max_concurrent_requests field was
changed to be an updateable_value, so that it is updated when the
corresponding option in Scylla's configuration is live-reloaded.
Unfortunately, due to how cql_server is constructed, this caused
cql_server instances on all shards to store an updateable_value which
pointed to an updateable_value_source on shard 0. Unsynchronized
cross-shard memory operations ensue.
The fix changes the cql_server_config so that it holds a function which
creates an updateable_value appropriate for the given shard. This
pattern is similar to another, already existing option in the config:
get_service_memory_limiter_semaphore.
This fix can be reverted if updateable_value becomes safe to use across
shards.
Tests: unit(dev)
Fixes: #7310
The 'redis_database_count' was already existing, but
was not used when initializing the keyspaces. This
patch merely uses it. I think it's better that way, it
seems cleaner not to create 15 x 5 tables when we
use only one redis database.
Also change a test to test with a higher max number
of database.
Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200930210256.4439-1-etienne.adam@gmail.com>
In order to improve observability, add a username field to the the
system_traces.sessions table. The system table should be change
while upgrading by running the fix_system_distributed_tables.py
script. Until the table is updated, the old behaviour is preserved.
Fixes#6737.
This miniseries enhances the code from #7279 by:
* adding metrics for shed requests, which will allow to pinpoint the problem if the max concurrent requests threshold is too low
* making the error message more comprehensive by pointing at the variable used to set max concurrent requests threshold
Example of an ehanced error message:
```
ConnectionException('Failed to initialize new connection to 127.0.0.1: Error from server: code=1001 [Coordinator node overloaded] message="too many in-flight requests (configured via max_concurrent_requests_per_shard): 18"',)})
```
Closes#7299
* github.com:scylladb/scylla:
transport: make _requests_serving param uint32_t
transport: make overloaded error message more descriptive
transport: add requests_shed metrics
Call sort_tokens at the caller as all call sites from
within token_metadata_impl call remove_endpoint for multiple
endpoints so the tokens can be re-sorted only once, when done
removing all tokens.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It's not realistic for a shard to have over 4 billion concurrent
requests, so this value can be safely represented in 32 bits.
Also, since the current concurrency limit is represented in uint32_t,
it makes sense for these two to have matching types.
"
The last major untracked area of the reader pipeline is the reader
buffers. These scale with the number of readers as well as with the size
and shape of data, so their memory consumption is unpredictable varies
wildly. For example many small rows will trigger larger buffers
allocated within the `circular_buffer<mutation_fragment>`, while few
larger rows will consume a lot of external memory.
This series covers this area by tracking the memory consumption of both
the buffer and its content. This is achieved by passing a tracking
allocator to `circular_buffer<mutation_fragment>` so that each
allocation it makes is tracked. Additionally, we now track the memory
consumption of each and every mutation fragment through its whole
lifetime. Initially I contemplated just tracking the `_buffer_size` of
`flat_mutation_reader::impl`, but concluded that as our reader trees are
typically quite deep, this would result in a lot of unnecessary
`signal()`/`consume()` calls, that scales with the number of mutation
fragments and hence adds to the already considerable per mutation
fragment overhead. The solution chosen in this series is to instead
track the memory consumption of the individual mutation fragments, with
the observation that these are typically always moved and very rarely
copied, so the number of `signal()`/`consume()` calls will be minimal.
This additional tracking introduces an interesting dilemma however:
readers will now have significant memory on their account even before
being admitted. So it may happen that they can prevent their own
admission via this memory consumption. To prevent this, memory
consumption is only forwarded to the semaphore upon admission. This
might be solved when the semaphore is moved to the front -- before the
cache.
Another consequence of this additional, more complete tracking is that
evictable readers now consume memory even when the underlying reader is
evicted. So it may happen that even though no reader is currently
admitted, all memory is consumed from the semaphore. To prevent any such
deadlocks, the semaphore now admits a reader unconditionally if no
reader is admitted -- that is if all count resources all available.
Refs: #4176
Tests: unit(dev, debug, release)
"
* 'track-reader-buffers/v2' of https://github.com/denesb/scylla: (37 commits)
test/manual/sstable_scan_footprint_test: run test body in statement sched group
test/manual/sstable_scan_footprint_test: move test main code into separate function
test/manual/sstable_scan_footprint_test: sprinkle some thread::maybe_yield():s
test/manual/sstable_scan_footprint_test: make clustering row size configurable
test/manual/sstable_scan_footprint_test: document sstable related command line arguments
mutation_fragment_test: add exception safety test for mutation_fragment::mutate_as_*()
test: simple_schema: add make_static_row()
reader_permit: reader_resources: add operator==
mutation_fragment: memory_usage(): remove unused schema parameter
mutation_fragment: track memory usage through the reader_permit
reader_permit: resource_units: add permit() and resources() accessors
mutation_fragment: add schema and permit
partition_snapshot_row_cursor: row(): return clustering_row instead of mutation_fragment
mutation_fragment: remove as_mutable_end_of_partition()
mutation_fragment: s/as_mutable_partition_start/mutate_as_partition_start/
mutation_fragment: s/as_mutable_range_tombstone/mutate_as_range_tombstone/
mutation_fragment: s/as_mutable_clustering_row/mutate_as_clustering_row/
mutation_fragment: s/as_mutable_static_row/mutation_as_static_row/
flat_mutation_reader: make _buffer a tracked buffer
mutation_reader: extract the two fill_buffer_result into a single one
...
Today when ever we are building scylla in a singel mode we still
building jmx, tools and python3 for all dev,release and debug.
Let's make sure we build only in relevant build mode
Also adding unified-tar to ninja build
Closes#7260
`trace_keyspace_helper::make_slow_query_mutation_data` expected a
"query" key in its parameters, which does not appear in case of
e.g. batches of prepared statements. This is example of failing
`record.parameters`:
```
...{"query[0]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"},
{"query[1]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"}...
```
In such case Scylla recorded no trace and said:
```
ERROR 2020-09-28 10:09:36,696 [shard 3] trace_keyspace_helper - No
"query" parameter set for a session requesting a slow_query_log record
```
Fix here is to leave query empty if not found. The users can still
retrieve the query contents from existing info.
Fixes#5843Closes#7293
The blacklog of current and max in VIEW_BACKLOG is not update but the
nodes are updating VIEW_BACKLOG all the time. For example:
```
INFO 2020-03-06 17:13:46,761 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486026590,718)
INFO 2020-03-06 17:13:46,821 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486026531,742)
INFO 2020-03-06 17:13:47,765 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486027590,721)
INFO 2020-03-06 17:13:47,825 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486027531,745)
INFO 2020-03-06 17:13:48,772 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486028590,726)
INFO 2020-03-06 17:13:48,833 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486028531,750)
INFO 2020-03-06 17:13:49,772 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486029590,729)
INFO 2020-03-06 17:13:49,832 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486029531,753)
```
The downside of such updates:
- Introduces more gossip exchange traffic
- Updates system.peers all the time
The extra unnecessary gossip traffic is fine to a cluster in a good
shape but when some of the nodes or shards are loaded, such messages and
the handling of such messages can make the system even busy.
With this patch, VIEW_BACKLOG is updated only when the backlog is really
updated.
Btw, we can even make the update only when the change of the backlog is
great than a threshold, e.g., 5%, which can reduce the traffic even
further.
Fixes#5970
We used to calculate the number of endpoints for quorum and local_quorum
unconditionally as ((rf / 2) + 1). This formula doesn't take into
account the corner case where RF = 0, in this situation quorum should
also be 0.
This commit adds the missing corner case.
Tests: Unit Tests (dev)
Fixes#6905Closes#7296
Detect if the compiler supports -Wstack-usage= and only enable it
if supported. Because each mode uses a different threshold, have each
mode store the threshold value and merge it into the flags only after
we decided we support it.
The switch to make it only a warning is made conditional on compiler
support.
On some environment, systemctl enable <service> fails when we use symlink.
So just directly copy systemd units to ~/.config/systemd/user, instead of
creating symlink.
Fixes#7288Closes#7290
After cleaning up old cluster features (253a7640e3)
the code for special handling of 1.7.4 counter order was effectively
only used in its own tests, so it can be safely removed.
Closes#7289
This series approaches issue #7072 and provides a very simple mechanism for limiting the number of concurrent CQL requests being served on a shard. Once the limit is hit, new requests will be instantly refused and OverloadedException will be returned to the client.
This mechanism has many improvement opportunities:
* shedding requests gradually instead of having one hard limit,
* having more than one limit per different types of queries (reads, writes, schema changes, ...),
* not using a preconfigured value at all, and instead figuring out the limit dynamically,
* etc.
... and none of these are taken into account in this series, which only adds a very basic configuration variable. The variable can be updated live without a restart - it can be done by updating the .yaml file and triggering a configuration re-read via sending the SIGHUP signal to Scylla.
The default value for this parameter is a very large number, which translates to effectively not shedding any requests at all.
Refs #7072Closes#7279
* github.com:scylladb/scylla:
transport: make max_concurrent_requests_per_shard reloadable
transport: return exceptional future instead of throwing
transport,config: add a param for max request concurrency
exceptions: make a single-param constructor explicit
exceptions: add a constructor based on custom message
This reverts commit 8366d2231d because it
causes the following "scylla_setup" failure on Ubuntu 16.04:
Command: 'sudo /usr/lib/scylla/scylla_setup --nic ens5 --disks /dev/nvme0n1 --swap-directory / '
Exit code: 1
Stdout:
Setting up libtomcrypt0:amd64 (1.17-7ubuntu0.1) ...
Setting up chrony (2.1.1-1ubuntu0.1) ...
Creating '_chrony' system user/group for the chronyd daemon…
Creating config file /etc/chrony/chrony.conf with new version
Processing triggers for libc-bin (2.23-0ubuntu11.2) ...
Processing triggers for ureadahead (0.100.0-19.1) ...
Processing triggers for systemd (229-4ubuntu21.29) ...
501 Not authorised
NTP setup failed.
Stderr:
chrony.service is not a native service, redirecting to systemd-sysv-install
Executing /lib/systemd/systemd-sysv-install enable chrony
Traceback (most recent call last):
File "/opt/scylladb/scripts/libexec/scylla_ntp_setup", line 63, in <module>
run('chronyc makestep')
File "/opt/scylladb/scripts/scylla_util.py", line 504, in run
return subprocess.run(cmd, stdout=stdout, stderr=stderr, shell=shell, check=exception, env=scylla_env).returncode
File "/opt/scylladb/python3/lib64/python3.8/subprocess.py", line 512, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['chronyc', 'makestep']' returned non-zero exit status 1.
This configuration entry is expected to be used as a quick fix
for an overloaded node, so it should be possible to reload this value
without having to restart the server.
The newly introduced parameter - max_concurrent_requests_per_shard
- can be used to limit the number of in-flight requests a single
coordinator shard can handle. Each surplus request will be
immediately refused by returning OverloadedException error to the client.
The default value for this parameter is large enough to never
actually shed any requests.
Currently, the limit is only applied to CQL requests - other frontends
like alternator and redis are not throttled yet.
The memory usage is now maintained and updated on each change to the
mutation fragment, so it needs not be recalculated on a call to
`memory_usage()`, hence the schema parameter is unused and can be
removed.
The memory usage of mutation fragments is now tracked through its
lifetime through a reader permit. This was the last major (to my current
knowledge) untracked piece of the reader pipeline.
We want to start tracking the memory consumption of mutation fragments.
For this we need schema and permit during construction, and on each
modification, so the memory consumption can be recalculated and pass to
the permit.
In this patch we just add the new parameters and go through the insane
churn of updating all call sites. They will be used in the next patch.
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.
Uses where this method was just used to move the fragment away are
converted to use `as_mutation_start() &&`.
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.
Uses where this method was just used to move the fragment away are
converted to use `as_range_tombstone() &&`.
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.
Uses where this method was just used to move the fragment away are
converted to use `as_clustering_row() &&`.
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.
Uses where this method was just used to move the fragment away are
converted to use `as_static_row() &&`.
Via a tracked_allocator. Although the memory allocations made by the
_buffer shouldn't dominate the memory consumption of the read itself,
they can still be a significant portion that scales with the number of
readers in the read.
Not used yet, this patch does all the churn of propagating a permit
to each impl.
In the next patch we will use it to track to track the memory
consumption of `_buffer`.
OverloadedException was historically only used when the number
of in-flight hints got too high. The other constructor will be useful
for using OverloadedException in other scenarios.
This can be used with standard containers and other containers that use
the std::allocator interface to track the allocations made by them via a
reader_permit.
In the next patches we plan to start tracking the memory consumption of
the actual allocations made by the circular_buffer<mutation_fragment>,
as well as the memory consumed by the mutation fragments.
This means that readers will start consuming memory off the permit right
after being constructed. Ironically this can prevent the reader from
being admitted, due to its own pre-admission memory consumption. To
prevent this hold on forwarding the memory consumption to the semaphore,
until the permit is actually admitted.
Track all resources consumed through the permit inside the permit. This
allows querying how much memory each read is consuming (as there should
be one read per permit). Although this might be interesting, especially
when debugging OOM cores, the real reason we are doing this is to be
able forward resource consumption to the semaphore only post-admission.
More on this in the patch introducing this.
Another advantage of tracking resources consumed through the permit is
that now we can detect resource leaks in the permit destructor and
report them. Even if it is just a case of the holder of the resources
wanting to release the resources later, with the permit destroyed it
will cause use-after-free.
In the next patches the reader permit will gain members that are shared
across all instances of the same permit. To facilitate this move all
internals into an impl class, of which the permit stores a shared
pointer. We use a shared_ptr to avoid defining `impl` in the header.
This is how the reader permit started in the beginning. We've done a
full circle. :)
And do all consuming and signalling through these methods. These
operations will soon be more involved than the simple forwarding they do
today, so we want to centralize them to a single method pair.
In the next patches we want to introduce per-permit resource tracking --
that is, have each permit track the amount of resource consumed through
it. For this, we need all consumption to happen through a permit, and
not directly with the semaphore.
To ensure progress at all times. This is due to evictable readers, who
still hold on to a buffer even when their underlying reader is evicted.
As we are introducing buffer and mutation fragment tracking in the next
patches, these readers will hold on to memory even in this state, so it
may theoretically happen that even though no readers are admitted (all
count resources all available) no reader can be admitted due to lack of
memory. To prevent such deadlocks we now always admit one reader if all
count resource are available.
Current code uses a single counter to produce multiple buffer worth of
data. This uses carry-on from on buffer to the other, which happens to
work with the current memory accounting but is very fragile. Account
each buffer separately, resetting the counter between them.
The test consumes all resources off the semaphore, leaving just enough
to admit a single reader. However this amount is calculated based on the
base cost of readers, but as we are going to track reader buffers as
well, the amount of memory consumed will be much less predictable.
So to make sure background readers can finish during shutdown, release
all the consumed resources before leaving scope.
No point in continuing processing the entire buffer once a failure was
found. Especially that an early failure might introduce conditions that
are not handled in the normal flow-path. We could handle these but there
is no point in this added complexity, at this point the test is failed
anyway.
Some tests rely on `consume*()` calls on the permit to take effect
immediately. Soon this will only be true once the permit has been
admitted, so make sure the permit is admitted in these tests.
Currently per-shard reader contexts are cleaned up as soon as the reader
itself is destroyed. This causes two problems:
* Continuations attached to the reader destroy future might rely on
stuff in the context being kept alive -- like the semaphore.
* Shard 0's semaphore is special as it will be used to account buffers
allocated by the multishard reader itself, so it has to be alive until
after all readers are destroyed.
This patch changes this so that contexts are destroyed only when the
lifecycle policy itself is destroyed.
* seastar e215023c7...292ba734b (4):
> future: Fix move of futures of reference type
> doc: fix hyper link to tutorial.html
> tutorial: fix formatting of code block
> README.md: fix the formatting of table
The reader recreation mechanism is a very delicate and error-prone one,
as proven by the countless bugs it had. Most of these bugs were related
to the recreated reader not continuing the read from the expected
position, inserting out-of-order fragments into the stream.
This patch adds a defense mechanism against such bugs by validating the
start position of the recreated reader.
The intent is to prevent corrupt data from getting into the system as
well as to help catch these bugs as close to the source as possible.
Fixes: #7208
Tests: unit(dev), mutation_reader_test:debug (v4)
* botond/evictable-reader-validate-buffer/v5:
mutation_reader_test: add unit test for evictable reader self-validation
evictable_reader: validate buffer after recreation the underlying
evictable_reader: update_next_position(): only use peek'd position on partition boundary
mutation_reader_test: add unit test for evictable reader range tombstone trimming
evictable_reader: trim range tombstones to the read clustering range
position_in_partition_view: add position_in_partition_view before_key() overload
flat_mutation_reader: add buffer() accessor
The reader recreation mechanism is a very delicate and error-prone one,
as proven by the countless bugs it had. Most of these bugs were related
to the recreated reader not continuing the read from the expected
position, inserting out-of-order fragments into the stream.
This patch adds a defense mechanism against such bugs by validating the
start position of the recreated reader. Several things are checked:
* The partition is the expected one -- the one we were in the middle of
or the next if we stopped at partition boundaries.
* The partition is in the read range.
* The first fragment in the partition is the expected one -- has a
an equal or larger position than the next expected fragment.
* The fragment is in the clustering range as defined by the slice.
As these validations are only done on the slow-path of recreating an
evicted reader, no performance impact is expected.
`evictable_reader::update_next_position()` is used to record the position the
reader will continue from, in the next buffer fill. This position is used to
create the partition slice when the underlying reader is evicted and has
to be recreated. There is an optimization in this method -- if the
underlying's buffer is not empty we peek at the first fragment in it and
use it as the next position. This is however problematic for buffer
validation on reader recreation (introduced in the next patch), because
using the next row's position as the next pos will allow for range
tombstones to be emitted with before_key(next_pos.key()), which will
trigger the validation. Instead of working around this, just drop this
optimization for mid-partition positions, it is inconsequential anyway.
We keep it for where it is important, when we detect that we are at a
partition boundary. In this case we can avoid reading the current
partition altogether when recreating the reader.
Currently mutation sources are allowed to emit range tombstones that are
out-of the clustering read range if they are relevant to it. For example
a read of a clustering range [ck100, +inf), might start with:
range_tombstone{start={ck1, -1}, end={ck200, 1}},
clustering_row{ck100}
The range tombstone is relevant to the range and the first row of the
range so it is emitted as first, but its position (start) is outside the
read range. This is normally fine, but it poses a problem for evictable
reader. When the underlying reader is evicted and has to be recreated
from a certain clustering position, this results in out-of-order
mutation fragments being inserted into the middle of the stream. This is
not fine anymore as the monotonicity guarantee of the stream is
violated. The real solution would be to require all mutation sources to
trim range tombstones to their read range, but this is a lot of work.
Until that is done, as a workaround we do this trimming in the evictable
reader itself.
The "mode" variable name is used everywhere, usually in a loop.
Therefore, rename the global "mode" to "checkheaders_mode" so that if
your code block happens to be outside of a loop, you don't accidentally
use the globally visible "mode" and spend hours debugging why it's
always "dev".
Spotted by Yaron Kaikov.
Message-Id: <20200924112237.315817-1-penberg@scylladb.com>
The script pull_github_pr.sh uses git merge's "--log" option to put in
the merge commit the list of titles of the individual patches being
merged in. This list is useful when later searching the log for the merge
which introduced a specific feature.
Unfortunately, "--log" defaults to cutting off the list of commit titles
at 20 lines. For most merges involving fewer than 20 commits, this makes
no difference. But some merges include more than 20 commits, and get
a truncated list, for no good reason. If someone worked hard to create a
patch set with 40 patches, the last thing we should be worried about is
that the merge commit message will be 20 lines longer.
Unfortunately, there appears to be no way to tell "--log" to not limit
the length at all. So I chose an arbitrary limit of 1000. I don't think
we ever had a patch set in Scylla which exceeded that limit. Yet :-)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200924114403.817893-1-nyh@scylladb.com>
This patch fixes a race between two methods in hints manager: drain_for
and store_hint.
The first method is called when a node leaves the cluster, and it
'drains' end point hints manager for that node (sends out all hints for
that node). If this method is called when the local node is being
decomissioned or removed, it instead drains hints managers for all
endpoints.
In the case of decomission/remove, drain_for first calls
parallel_for_each on all current ep managers and tells them to drain
their hints. Then, after all of them complete, _ep_managers.clear() is
called.
End point hints managers are created lazily and inserted into
_ep_managers map the first time a hint is stored for that node. If
this happens between parallel_for_each and _ep_managers.clear()
described above, the clear operation will destroy the new ep manager
without draining it first. This is a bug and will trigger an assert in
ep manager's destructor.
To solve this, a new flag for the hints manager is added which is set
when it drains all ep managers on removenode/decommission, and prevents
further hints from being written.
Fixes#7257Closes#7278
Currently, sstable_manager is used to create sstables, but it loses track
of them immediately afterwards. This series makes an sstable's life fully
contained within its sstable_manager.
The first practical impact (implemented in this series) is that file removal
stops being a background job; instead it is tracked by the sstable_manager,
so when the sstable_manager is stopped, you know that all of its sstable
activity is complete.
Later, we can make use of this to track the data size on disk, but this is not
implemented here.
Closes#7253
* github.com:scylladb/scylla:
sstables: remove background_jobs(), await_background_jobs()
sstables: make sstables_manager take charge of closing sstables
test: test_env: hold sstables_manager with a unique_ptr
test: drop test_sstable_manager
test: sstables::test_env: take ownership of manager
test: broken_sstable_test: prepare for asynchronously closed sstables_manager
test: sstable_utils: close test_env after use
test: sstable_test: dont leak shared_sstable outside its test_env's lifetime
test: sstables::test_env: close self in do_with helpers
test: perf/perf_sstable.hh: prepare for asynchronously closed sstables_manager
test: view_build_test: prepare for asynchronously closed sstables_manager
test: sstable_resharding_test: prepare for asynchronously closed sstables_manager
test: sstable_mutation_test: prepare for asynchronously closed sstables_manager
test: sstable_directory_test: prepare for asynchronously closed sstables_manager
test: sstable_datafile_test: prepare for asynchronously closed sstables_manager
test: sstable_conforms_to_mutation_source_test: remove references to test_sstables_manager
test: sstable_3_x_test: remove test_sstables_manager references
test: schema_changes_test: drop use of test_sstables_manager
mutation_test: adjust for column_family_test_config accepting an sstables_manager
test: lib: sstable_utils: stop using test_sstables_manager
test: sstables test_env: introduce manager() accessor
test: sstables test_env: introduce do_with_async_sharded()
test: sstables test_env: introduce do_with_async_returning()
test: lib: sstable test_env: prepare for life as a sharded<> service
test: schema_changes_test: properly close sstables::test_env
test: sstable_mutation_test: avoid constructing temporary sstables::test_env
test: mutation_reader_test: avoid constructing temporary sstables::test_env
test: sstable_3_x_test: avoid constructing temporary sstables::test_env
test: lib: test_services: pass sstables_manager to column_family_test_config
test: lib: sstables test_env: implement tests_env::manager()
test: sstable_test: detemplate write_and_validate_sst()
test: sstable_test_env: detemplate do_with_async()
test: sstable_datafile_test: drop bad 'return'
table: clear sstable set when stopping
table: prevent table::stop() race with table::query()
database: close sstable_manager:s
sstables_manager: introduce a stub close()
sstable_directory_test: fix threading confusion in make_sstable_directory_for*() functions
test: sstable_datafile_test: reorder table stop in compaction_manager_test
test: view_build_test: test_view_update_generator_register_semaphore_unit_leak: do not discard future in timer
test: view_build_test: fix threading in test_view_update_generator_register_semaphore_unit_leak
view: view_update_generator: drop references to sstables when stopping
Echo message does not need to access gossip internal states, we can run
it on all shards and avoid forwarding to shard zero.
This makes gossip marking node up more robust when shard zero is loaded.
There is an argument that we should make echo message return only when
all shards have responded so that all shards are live and responding.
However, in a heavily loaded cluster, one shard might be overloaded on
multiple nodes in the cluster at the same time. If we require echo
response on all shards, we have a chance local node will mark all peer
nodes as down. As a result, the whole cluster is down. This is much
worse than not excluding a node with a slow shard from a cluster.
Refs: #7197
Gossip echo message is used to confirm a node is up. In a heavily loaded
slow cluster, a node might take a long time to receive a heart beat
update, then the node uses the echo message to confirm the peer node is
really up.
If the echo message timeout too early, the peer node will not be marked
as up. This is bad because a live node is marked as down and this could
happen on multiple nodes in the cluster which causes cluster wide
unavailability issue. In order to prevent multiple nodes to marked as
down, it is better to be conservative and less restrictive on echo
message timeout.
Note, echo message is not used to detect a node down. Increasing the
echo timeout does not have any impact on marking a node down in a timely
manner.
Refs: #7197
Currently, closing sstables happens from the sstable destructor.
This is problematic since a destructor cannot wait for I/O, so
we launch the file close process in the background. We therefore
lose track of when the closing actually takes place.
This patch makes sstables_manager take charge of the close process.
Every sstable is linked into one of two intrusive lists in its
manager: _active or _undergoing_close. When the reference count
of the sstable drops to zero, we move it from _active to
_undergoing_close and begin closing the files. sstables_manager
remembers all closes and when sstables_manager::close() is called,
it waits for all of them to complete. Therefore,
sstables_manager::close() allows us to know that all files it
manages are closed (and deleted if necessary).
The sstables_manager also gains a destructor, which disables
move construction.
sstables_manager should not be movable (since sstables hold a reference
to it). A following patch will enforce it.
Prepare by using unique_ptr to hold test_env::_manager. Right now, we'll
invoke sstables_manager move construction when creating a test_env with
do_with().
We could have chosen to update sstables when their sstables_manager is
moved, but we get nothing for the complexity.
With no users left (apart from some variants of column_family_test_config
which are removed in this patch) there are no more users, so remove it.
test_sstable_manager is obstructs sstables_manager from taking charge
of sstables ownership, since it a thread-local object. We can't close it,
since it will be used in the next test to run.
Instead of using test_sstables_manager, which we plan to drop,
carry our own sstables_manager in test_env, and close it when
test_env::stop() is called.
do_write_sst() creates a test_env, creates a shared_sstable using that test_env,
and destroys the test_env, and returns the sstable. This works now but will
stop working once sstable_manager becomes responsible for sstable lifetime.
Fortunately, do_write_sst() has one caller that isn't interested in the
return value at all, so fold it into that caller.
The test_env::do_with() are convenient for creating a scope
containing a test_env. Prepare them for asynchronously closed
sstables_manager by closing the test_env after use (which will,
in the future, close the embedded sstables_manager).
sstables_manager will soon be closed asynhronously, with a future-returning
close() function. To prepare for that, make the following changes
- replace test_sstables_manager with an sstables_manager obtained from test_env
- drop unneeded calls to await_background_jobs()
These changes allow lifetime management of the sstables_manager used
in the tests to be centralized in test_env.
sstables_manager will soon be closed asynhronously, with a future-returning
close() function. To prepare for that, make the following changes
- acquire a test_env with test_env::do_with() (or the sharded variant)
- change the sstable_from_existing_file function to be a functor that
works with either cql_test_env or test_env (as this is what individual
tests want); drop use of test_sstables_manager
- change new_sstable() to accept a test_env instead of using test_sstables_manager
- replace test_sstables_manager with an sstables_manager obtained from test_env
These changes allow lifetime management of the sstables_manager used
in the tests to be centralized in test_env.
sstables_manager will soon be closed asynhronously, with a future-returning
close() function. To prepare for that, make the following changes
- replace on-stack test_env with test_env::do_with()
- use the variant of column_family_for_tests that accepts an sstables_manager
- replace test_sstables_manager with an sstables_manager obtained from test_env
These changes allow lifetime management of the sstables_manager used
in the tests to be centralized in test_env.
Since test_env now calls await_background_jobs on termination, those
calls are dropped.
Use the sstables_manager from test_env. Use do_with_async() to create the test_env,
to allow for proper closing.
Since do_with_async() also takes care of await_background_jobs(), remove that too.
test_sstables_manager is going away, so replace it by test_env::manager().
column_family_test_config() has an implicit reference to test_sstables_manager,
so pass test_env::manager() as a parameter.
Calls to await_background_jobs() are removed, since test_env::stop() performs
the same task.
The large rows tests are special, since they use a custom sstables_manager,
so instead of using a test_env, they just close their local sstables_manager.
Acquire a test_env and extract an sstables_manager from that, passing it
to column_familty_test_config, in preparation for losing the default
constructor of column_familty_test_config.
Some tests need a sharded sstables_manager, prepare for that by
adding a stop() method and helpers for creating a sharded service.
Since test_env doesn't yet contain its own sstable_manager, this
can't be used in real life yet.
sstables::test_env needs to be properly closed (and will soon need it
even more). Use test_env::do_with_async() to do that. Removed
await_background_jobs(), which is now done by test_env::close().
A test_env contains an sstables_manager, which will soon have a close() method.
As such, it can no longer be a temporary. Switch to using test_env::do_with_async().
As a bonus, test_env::do_with_async() performs await_background_jobs() for us, so
we can drop it from the call sites.
A test_env contains an sstables_manager, which will soon have a close() method.
As such, it can no longer be a temporary. Switch to using test_env::do_with_async().
A test_env contains an sstables_manager, which will soon have a close() method.
As such, it can no longer be a temporary. Switch to using test_env::do_with_async().
As a bonus, test_env::do_with_async() performs await_background_jobs() for us, so
we can drop it from the call sites.
Since we're dropping test_sstables_manager, we'll require callers to pass it
to column_family_test_config, so provide overloads that accept it.
The original overloads (that don't accept an sstables_manager) remain for
the transition period.
Some tests are now referencing the global test_sstables_manager,
which we plan to remove. Add test_env::manager() as a way to
reference the sstables_manager that the test_env contains.
The pattern
return function_returning_a_future().get();
is legal, but confusing. It returns an unexpected std::tuple<>. Here,
it doesn't do any harm, but if we try to coerce the surrounding code
into a signature (void ()), then that will fail.
Remove the unneeded and unexpected return.
Drop references to a table's sstables when stopping it, so that
the sstable_manager can start deleting it. This includes staging
sstables.
Although the table is no longer in use at this time, maintain
cache synchronity by calling row_cache::invalidate() (this also
has the benefit of avoiding a stall in row_cache's destructor).
We also refresh the cache's view of the sstable set to drop
the cache's references.
Take the gate in table::query() so that stop() waits for queries. The gate
is already waited for in table::stop().
This allows us to know we are no longer using the table's sstables in table::stop().
sstables_manager is going to take charge of its sstables lifetimes,
so it will need a close() to wait until sstables are deleted.
This patch adds sstables_manager::close() so that the surrounding
infrastructure can be wired to call it. Once that's done, we can
make it do the waiting.
The make_sstable_directory_for*() functions run in a thread, and
call functions that run in a thread, but return a future. This
more or less works but is a dangerous construct that can fail.
Fix by returning a regular value.
Stopping a table will soon close its sstables; so the next check will fail
as the number of sstables for the table will be zero.
Reorder the stop() call to make it safe.
We don't need the stop() for the check, since the previous loop made sure
compactions completed.
test_view_update_generator_register_semaphore_unit_leak creates a continuation chain
inside a timer, but does not wait for it. This can result in part of the chain
being executed after its captures have been destroyed.
This is unlikely to happen since the timer fires only if the test fails, and
tests never fail (at least in the way that one expects).
Fix by waiting for that future to complete before exiting the thread.
test_view_update_generator_register_semaphore_unit_leak uses a thread function
in do_with_cql_env(), even though the latter doesn't promise a thread and
accepts a regular function-returning-a-future. It happens to work because the
function happens to be called in a thread, but this isn't guaranteed.
Switch to do_with_cql_env, which guarantees a thread context.
sstable_manager will soon wait for all sstables under its
control to be deleted (if so marked), but that can't happen
if someone is holding on to references to those sstables.
To allow sstables_manager::stop() to work, drop remaining
queued work when terminating.
The script scripts/pull_github_pr.sh begins by fetching some information
from github, which can cause a noticable wait that the user doesn't
understand - so in this patch we add a couple of messages on what is
happening in the beginning of the script.
Moreover, if an invalid pull-request number is given, the script used
to give mysterious errors when incorrect commands ran using the name
"null" - in this patch we recognize this case and print a clear "Not Found"
error message.
Finally, the PR_REPO variable was never used, so this patch removes it.
Message-Id: <20200923151905.674565-1-nyh@scylladb.com>
Currently, scripts/pull_pr pollutes the local branch namespace
by creating a branch and never deleting it. This can be avoided
by using FETCH_HEAD, a temporary name automatically assigned by
git to fetches with no destination.
"
Currently all logical read operations are counted in a single pair of
metrics (successful/failed) located in the `database::db_stats`. This
prevents observing the number of reads executed against the
user/system
read semaphores. This distinction is especially interesting since
0c6bbc84c which selects the semaphore for each read based on the
scheduling group it is running under.
This mini series moves these counters into the semaphore and updates
the
exported metrics accordingly, the `total_reads` and
`total_reads_failed`
now has a user/system lable, just like the other semaphore dependent
metrics.
Tests: manual(checked that new metric works)
"
* 'per-semaphore-read-metrics/v2' of https://github.com/denesb/scylla:
database: move total_reads* metrics to the concurrency semaphore
database: setup_metrics(): split the registering database metrics in two
reader_concurrency_semaphore: add non-const stats accessor
reader_concurrency_semaphore: s/inactive_read_stats/stats/
Currently all "database" metrics are registered in a single call to
`metric_groups::add_group()`. As all the metrics to-be-registered are
passed in a single initializer list, this blows up the stack size, to
the point that adding a single new metric causes it to exceed the
currently configured max-stack-size of 13696 bytes. To reduce stack
usage, split the single call in two, roughly in the middle. While we
could try to come up with some logical grouping of metrics and do much
arranging and code-movement I think we might as well just split into two
arbitrary groups, containing roughly the same amount of metrics.
Commit 8e1f7d4fc7 ("dist/common/scripts: drop makedirs(), use
os.makedirs()") dropped the "mkdirs()" function, but forgot to convert
the caller in scylla_kernel_check to os.mkdirs().
Message-Id: <20200923104510.230244-1-penberg@scylladb.com>
In preparations of non-inactive read stats being added to the semaphore,
rename its existing stats struct and member to a more generic name.
Fields, whose name only made sense in the context of the old name are
adjusted accordingly.
As the test test_streams_closed_read confirmed, when a stream shard is
closed, GetRecords should not return a NextShardIterator at all.
Before this patch we wrongly returned an empty string for it.
Before this patch, several Alternator Stream tests (in test_streams.py)
failed when running against a multi-node Scylla cluster. The reason is as
follows: As a multi-node cluster boots and more and more nodes enter the
cluster, the cluster changes its mind about the token ownership, and
therefore the list of stream shards changes. By the time we have the full
cluster, a bunch of shards were created and closed without any data yet.
All the tests will see these closed shards, and need to understand them.
The fetch_more() utility function correctly assumed that a closed shard
does not return a NextShardIterator, and got confused by the empty string
we used to return.
Now that closed shards can return responses without NextShardIterator,
we also needed to fix in this patch a couple of tests which wrongly assumed
this can't happen. These tests did not fail on DynamoDB because unlike in
Scylla, DynamoDB does not have any closed shards in normal tests which
do not specifically cause them (only test_streams_closed_read).
We also need to fix test_streams_closed_read to get rid of an unnecessary
assumption: It currently assumes that when we read the very last item in
a closed shard is read, the end-of-shard is immediately signaled (i.e.,
NextShardIterator is not returned). Although DynamoDB does in fact do this,
it is also perfectly legal for Alternator's implementation to return the
last item with a new NextShardIterator - and only when the client reads
from that iterator, we finally return the signal the end of the shard.
Fixes#7237.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200922082529.511199-1-nyh@scylladb.com>
While many of these warnings are important, they can be addressed
at at a lower priority. In any case it will be easier to enforce
the warnings if/when we switch to clang.
The classes touche private data of each other for no real
reason. Putting the interaction behind API makes it easier
to track the usage.
* xemul/br-unfriends-in-row-cache-2:
row cache: Unfriend classes from each other
rows_entry: Move container/hooks types declarations
rows_entry: Simplify LRU unlink
mutation_partition: Define .replace_with method for rows_entry
mutation_partition: Use rows_entry::apply_monotonically
Merged pull request https://github.com/scylladb/scylla/pull/7264 by
Avi Kivity (29 commits):
This series fixes issues found by clang. Most are real issues that gcc just
doesn't find, a few are due to clang lagging behind on some C++ updates.
See individual explanations in patches.
The series is not sufficient to build with clang; it just addresses the
simple problems. Two larger problems remain: clang isn't able to compile
std::ranges (not clear yet whether this is a libstdc++ problem or a clang
problem) and clang can't capture structured binding variables (due to
lagging behind on the standard).
The motivation for building with clang is gaining access to a working
implementation of coroutines and modules.
This series compiles with gcc and the unit tests pass.
Currently in all cases we first deduct the to-be-consumed resources,
then construct the `reader_resources` class to protect it (release it on
destruction). This is error prone as it relies on no exception being
thrown while constructing the `reader_resources`. Albeit the
`reader_resources` constructor is `noexcept` right now this might change
in the future and as the call sites relying on this are disconnected
from the declaration, the one modifying them might not notice.
To make this safe going forward, make the `reader_resources` a true RAII
class, consuming the units in its constructor and releasing them in its
destructor.
Fixes: #7256
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200922150625.1253798-1-bdenes@scylladb.com>
We have templates for multiprecision_int for both sides of the operator,
for example:
template <typename T>
bool operator==(const T& x) const
and
template <typename T>
friend bool operator==(const T& x, const multiprecision_int& y)
Clang considers them equally satisfying when both operands are
multiprecision_int, so provide a disambiguating overload.
Clang dislikes forward-declared functions returning auto, so declare the
type up front. Functions returning auto are a readability problem
anyway.
To solve a circular dependency problem (get_local_injector() ->
error_injection<> -> get_local_injector()), which is further compounded
by problems in using template specializations before they are defined
(which is forbidden), the storage for get_local_injector() was moved
to error_injection<>, and get_local_injector() is just an accessor.
After this, error_injection<> does not depend on get_local_injector().
new_sstable is defined as a template, and later used in a context
that requires an object. Somehow gcc uses an instantiation with
an empty template parameter list, but I don't think it's right,
and clang refuses.
Since the template is gratuitous anyway, just make it a regular
function.
This patch adds a test, test_streams_closed_read, which reproduces
two issues in Alternator Streams, regarding the behavior of *closed*
stream shards:
Refs #7239: After streaming is disabled, the stream should still be readable,
it's just that all its shards are now "closed".
Refs #7237: When reaching the end of a closed shard, NextShardIterator should
be missing. Not set to an empty string as we do today.
The test passes on DynamoDB, and xfails on Alterator, and should continue to
do so until both issues are fixed.
This patch changes the implementation of the disable_stream() function.
This function was never actually used by the existing code, and now that
I wanted to use it, I discovered it didn't work as expected and had to fix it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200915134643.236273-1-nyh@scylladb.com>
The gossiper verbs are registered in two places -- start_gossiping
and do_shadow_round(). And unregistered in one -- stop_gossiping
iff the start took place. Respectively, there's a chance that after
a shadow round scylla exits without starting gossiping thus leaving
verbs armed.
Fix by unregistering verbs on stop if they are still registered.
fixes: #7262
tests: manual(start, abort start after shadow round), unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200921140357.24495-1-xemul@scylladb.com>
Before updating the _last_[cp]key (for subsequent .fetch_page())
the pager checks is 'if the pager is not exhausted OR the result
has data'.
The check seems broken: if the pager is not exhausted, but the
result is empty the call for keys will unconditionally try to
reference the last element from empty vector. The not exhausted
condition for empty result can happen if the short_read is set,
which, in turn, unconditionally happens upon meeting partition
end when visiting the partition with result builder.
The correct check should be 'if the pager is not exhausted AND
the result has data': the _last_[pc]key-s should be taken for
continuation (not exhausted), but can be taken if the result is
not empty (has data).
fixes: #7263
tests: unit(dev), but tests don't trigger this corner case
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200921124329.21209-1-xemul@scylladb.com>
The .get_last_partition_and_clustering_key() method gets
the last partition from the on-board vector of partitions.
The vector in question is assumed not to be empty, but if
this assumption breaks, the result will look like memory
corruption (docs say that accessing empty's vector back()
results in undefined behavior).
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200921122948.20585-1-xemul@scylladb.com>
Clang has a hard time dealing with single-element initializer lists. In this
case, adding an explicit conversion allows it to match the
initializer_list<data_value> parameter.
std::log() is not constexpr, so it cannot be assigned to a constexpr object.
Make it non-constexpr and automatic. The optimizer still figures out that it's
constant and optimizes it.
Found by clang. Apparently gcc only checks the expression is constant, not
constexpr.
Casting from the maximum int64_t to double loses precision, because
int64_t has 64 bits of precision while double has only 53. Clang
warns about it. Since it's not a real problem here, add an explicit
cast to silence the warning.
index_reader::index_bound must be constructible by non-friend classes
since it's used in std::optional (which isn't anyone's friend). This
now works in gcc because gcc's inter-template access checking is broken,
but clang correctly rejects it.
promoted_index_blocks_reader has a data member called "state", and a type member
called "state". Somehow gcc manages to disambiguate the two when used, but
clang doesn't. I believe clang is correct here, one member should subsume the other.
Change the type member to have a different name to disambiguate the two.
std::out-of-range does not have a default constructor, yet gcc somehow
accepts a no-argument construction. Clang (correctly) doesn't, so add
a parameter.
We use class template argument deduction (CTAD) in a few places, but it appears
not to work for alias templates in clang. While it looks like a clang bug, using
the class name is an improvement, so let's do that.
bool_class only has explicit conversion to bool, so an assignment such as
bool x = bool_class<foo>(true);
ought to fail. Somehow gcc allows it, but I believe clang is correct in
disallowing it.
Fix by using 'auto' to avoid the conversion.
Due to a bug, clang does not decay a type to a reference, failing
the concept evaluation on correct input. Add parentheses to force
it to decay the type.
atomic_cell_or_collection is also declared as a friend class further
down, and gcc appears to inject this friend declration into the
global namespace. Clang appears not to, and so complains when
atomic_cell_or_collection is mentioned in the declaration of
merge_column().
Add a forward declaration in the global namespace to satisfy clang.
We call ostringstream::view(), but that member doesn't exist. It
works because it is guarded by an #ifdef and the guard isn't satisified,
but if it is (as with clang) it doesn't compile. Remove it.
base64_chars() calls strlen() from a static_assert, but strlen() isn't
(and can't be) constexpr. gcc somehow allows it, but clang rightfully
complains.
Fix by using a character array and sizeof, instead of a pointer
and strlen().
The constraint on on cryptopp_hasher<>::impl is illegal, since it's not
on the base template. Convert it to a static_assert.
We could have moved it to the base template, but that would have undone
the work to push all the implementation details in .cc and reduce #include
load.
Found by clang.
While we need CryptoPP Update() functions not to throw, they aren't
marked noexcept.
Since there is no reason for them to throw, we'll just hope they
don't, and relax the requirement.
Found by clang. Apparently gcc didn't bother to check the constraint
here.
scylla_util.py becomes large, some code are unused, some code are only for specific script.
Drop unnecessary things, move non-common functions to caller script, make scylla_util.py simple.
Closes#7102
* syuu1228-refactor_scylla_util:
scylla_util.py: de-duplicate code on parse_scylla_dirs_with_default() and get_scylla_dirs()
scylla_util.py: remove rmtree() and redhat_version() since these are unused
dist/common/scripts: drop makedirs(), use os.makedirs()
dist/common/scripts: drop hex2list.py since it's nolonger used
dist/common/scripts: drop is_systemd() since we nolonger support non-systemd environment
dist/common/scripts: drop dist_name() and dist_ver()
dist/common/scripts: move functions that are only called from single file
Seems like parse_scylla_dirs_with_default() and get_scylla_dirs() shares most of
the code, de-duplicate it.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
scylla_util.py is a library for common functions across setup scripts,
it should not include private function of single file.
So move all those functions to caller file.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
* seastar dc06cd1f0f...9ae33e67e1 (9):
> expiring_fifo: mark methods noexcept
> chunked_fifo: mark methods noexcept
> circular_buffer: support stateful allocators
> net/posix-stack: fix sockaddr forward reference
> rpc: fix LZ4_DECODER_RING_BUFFER_SIZE not defined
> futures_test: fix max_concurrent_for_each concepts error
> core: limit memory size for each process to 64GB
> core/reactor_backend: kill unused nr_retry
> Merge "IO tracing" from Pavel E
"
Based on https://github.com/scylladb/scylla/issues/7199,
it looks like storage_service::set_tables_autocompaction
may be called on shards other than 0.
Use run_with_api_lock to both serialize the action
and to check _initialized on shard 0.
Fixes#7199
Test: unit(dev), compaction_test:TestCompaction_with_SizeTieredCompactionStrategy.disable_autocompaction_nodetool_test
"
* tag 'set_tables_autocompaction-v1' of github.com:bhalevy/scylla:
storage_service: set_tables_autocompaction: fixup indentation
storage_service: set_tables_autocompaction: run_with_api_lock
storage_service: set_tables_autocompaction: use do_with to hold on to args
storage_service: set_tables_autocompaction: log message in info level
We need to skip internet access on offline installation.
To do this we need following changes:
- prevent running yum/apt for each script
- set default "NO" for scripts it requires package installation
- set default "NO" for scripts it requires internet access, such as NTP
See #7153Closes#7224
* syuu1228-offline_setup:
dist/common/scripts: skip internet access on offline installation
scylla_ntp_setup: use shutil.witch() to lookup command
- Add necessary changes to `scylla_util.py` in order to support Scylla Machine Image in GCE.
- Fixes and improvements for `curl` function.
Closes#7080
* bentsi-gce-image-support:
scylla_util.py: added GCE instance/image support
scylla_util.py: make max_retries as a curl function argument
scylla_util.py: adding timeout to curl function
scylla_util.py: styling fixes to curl function
scylla_util.py: change default value for headers argument in curl function
Let's build scylla-unified-package.tar.gz in build/<mode>/dist/tar for
symmetry. The old location is still kept for backward compatibility for
now. Also document the new official artifact location.
Message-Id: <20200917071131.126098-1-penberg@scylladb.com>
"
... and make sure nothing is left. Whith the help of fresh seastar
this can be done quickly.
Before doing this check -- unregister remaining verbs in repair and
storage_service and fix tests not to register verbs, because they
are all local.
tests: unit(dev), manual
"
* 'br-messaging-service-stop-all' of https://github.com/xemul/scylla:
messaging_service: Report still registered services as errors
repair: Move CHECKSUM_RANGE verb into repair/
repair: Toss messaging init/uninit calls
storage_service: Uninit RPC verbs
test: Do not init messaging verbs
On stop -- unregister the CLIENT_ID verb, which is registerd
in constructor, then check for any remaining ones.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The verb is sent by repair code, so it should be registered
in the same place, not in main. Also -- the verb should be
unregistered on stop.
The global messaging service instance is made similarly to the
row-level one, as there's no ready to use repair service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There goal is to make it possible to reg/unreg not only row-level
verbs. While at it -- equip the init call with sharded<database>&
argument, it will be needed by the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch adds a "--list" option to test.py that shows a list of tests
instead of executing them. This is useful for people and scripts, which
need to discover the tests that will be run. For example, when Jenkins
needs to store failing tests, it can use "test.py --list" to figure out
what to archive.
Message-Id: <20200916135714.89350-1-penberg@scylladb.com>
And move-assign to _pending_ranges_interval_map[keyspace_name]
only when done.
This is more effient since there's no need to look up
_pending_ranges_interval_map[keyspace_name] for every insert to the
interval_map.
And it is exception safe in case we run out of memory mid-way.
Refs #7220
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200916115059.788606-1-bhalevy@scylladb.com>
This patch fixes a bug noted in issue #7218 - where PutItem operations
sometimes lose part of the item's data - some attributes were lost,
and the name of other attributes replaced by empty strings. The problem
happened when the write-isolation policy was LWT and there was contention
of writes to the same partition (not necessarily the same item).
To use CAS (a.k.a. LWT), Alternator builds an alternator::rmw_operation
object with an apply() function which takes the old contents of the item
(if needed) and a timestamp, and builds a mutation that the CAS should
apply. In the case of the PutItem operation, we wrongly assumed that apply()
will be called only once - so as an optimization the strings saved in the
put_item_operation were moved into the returned mutation. But this
optimization is wrong - when there is contention, apply() may be called
again when the changed proposed by the previous one was not accepted by
the Paxos protocol.
The fix is to change the one place where put_item_operation *moved* strings
out of the saved operations into the mutations, to be a copy. But to prevent
this sort of bug from reoccuring in future code, this patch enlists the
compiler to help us verify that it can't happen: The apply() function is
marked "const" - it can use the information in the operation to build the
mutation, but it can never modify this information or move things out of it,
so it will be fine to call this function twice.
The single output field that apply() does write (_return_attributes) is
marked "mutable" to allow the const apply() to write to it anyway. Because
apply() might be called twice, it is important that if some apply()
implementation sometimes sets _return_attributes, then it must always
set it (even if to the default, empty, value) on every call to apply().
The const apply() means that the compiler verfies for us that I didn't
forget to fix additional wrong std::move()s. Additionally, a test I wrote
to easily reproduce issue #7218 (which I will submit as a dtest later)
passes after this fix.
Fixes#7218.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200916064906.333420-1-nyh@scylladb.com>
Add a "Closes #$PR_NUM" annotation at the end of the commit
message to tell github to close the pull request, preventing
manual work and/or dangling pull requests.
Closes#7245
"
This series follows the suggestion from https://github.com/scylladb/scylla/pull/7203#issuecomment-689499773 discussion and deprecates a number of cluster features. The deprecation does not remove any features from the strings sent via gossip to other nodes, but it removes all checks for these features from code, assuming that the checks are always true. This assumption is quite safe for features introduced over 2 years ago, because the official upgrade path only allows upgrading from a previous official release, and these feature bits were introduced many release cycles ago.
All deprecated features were picked from a `git blame` output which indicated that they come from 2018:
```git
e46537b7d3 2016-05-31 11:44:17 +0200 RANGE_TOMBSTONES_FEATURE = "RANGE_TOMBSTONES";
85c092c56c 2016-07-11 10:59:40 +0100 LARGE_PARTITIONS_FEATURE = "LARGE_PARTITIONS";
02bc0d2ab3 2016-12-09 22:09:30 +0100 MATERIALIZED_VIEWS_FEATURE = "MATERIALIZED_VIEWS";
67ca6959bd 2017-01-30 19:50:13 +0000 COUNTERS_FEATURE = "COUNTERS";
815c91a1b8 2017-04-12 10:14:38 +0300 INDEXES_FEATURE = "INDEXES";
d2a2a6d471 2017-08-03 10:53:22 +0300 DIGEST_MULTIPARTITION_READ_FEATURE = "DIGEST_MULTIPARTITION_READ";
ecd2bf128b 2017-09-01 09:55:02 +0100 CORRECT_COUNTER_ORDER_FEATURE = "CORRECT_COUNTER_ORDER";
713d75fd51 2017-09-14 19:15:41 +0200 SCHEMA_TABLES_V3 = "SCHEMA_TABLES_V3";
2f513514cc 2017-11-29 11:57:09 +0000 CORRECT_NON_COMPOUND_RANGE_TOMBSTONES = "CORRECT_NON_COMPOUND_RANGE_TOMBSTONES";
0be3bd383b 2017-12-04 13:55:36 +0200 WRITE_FAILURE_REPLY_FEATURE = "WRITE_FAILURE_REPLY";
0bab3e59c2 2017-11-30 00:16:34 +0000 XXHASH_FEATURE = "XXHASH";
fbc97626c4 2018-01-14 21:28:58 -0500 ROLES_FEATURE = "ROLES";
802be72ca6 2018-03-18 06:25:52 +0100 LA_SSTABLE_FEATURE = "LA_SSTABLE_FORMAT";
71e22fe981 2018-05-25 10:37:54 +0800 STREAM_WITH_RPC_STREAM = "STREAM_WITH_RPC_STREAM";
```
Tests: unit(dev)
manual(verifying with cqlsh that the feature strings are indeed still set)
"
Closes#7234.
* psarna-clean_up_features:
gms: add comments for deprecated features
gms: remove unused feature bits
streaming: drop checks for RPC stream support
roles: drop checks for roles schema support
service: drop checks for xxhash support
service: drop checks for write failure reply support
sstables: drop checks for non-compound range tombstones support
service: drop checks for v3 schema support
repair: drop checks for large partitions support
service: drop checks for digest multipartition read support
sstables: drop checks for correct counter order support
cql3: drop checks for materialized views support
cql3: drop checks for counters support
cql3: drop checks for indexing support
We need to skip internet access on offline installation.
To do this we need following changes:
- prevent running yum/apt for each script
- set default "NO" for scripts it requires package installation
- set default "NO" for scripts it requires internet access, such as NTP
See #7153Fixes#7182
The command installed directory may different between distributions,
we can abstract the difference using shutil.witch().
Also the script become simpler than passing full path to os.path.exists().
Consider an unpaged query that consumes all of available memory, despite
fea5067dfa which limits them (perhaps the
user raised the limit, or this is a system query). Eventually we will see a
bad_alloc which will abort the query and destroy this reconcilable_result_builder.
During destruction, we first destroy _memory_accounter, and then _result.
Destroying _memory_accounter resumes some continuations which can then
allocate memory synchronously when increasing the task queue to accomodate
them. We will then crash. Had we not crashed, we would immediately afterwards
release _result, freeing all the memory that we would ever need.
Fix by making _result the last member, so it is freed first.
Fixes#7240.
Introduce new database config option `schema_registry_grace_period`
describing the amount of time in seconds after which unused schema
versions will be cleaned up from the schema registry cache.
Default value is 1 second, the same value as was hardcoded before.
Tests: unit(debug)
Refs: #7225
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200915131957.446455-1-pa.solodovnikov@scylladb.com>
The set contains 3 small optimizations:
- avoid copying of partition key on lookup path
- reduce number of args carried around when creating a new entry
- save one partition key comparison on reader creation
Plus related satellite cleanups.
* https://github.com/xemul/scylla/tree/br-row-cache-less-copies:
row_cache: Revive do_find_or_create_entry concepts
populating reader: Do not copy decorated key too early
populating reader: Less allocator switching on population
populating reader: Fix indentation after previous patch
row_cache: Move missing entry creation into helper
test: Lookup an existing entry with its own helper
row_cache: Do not copy partition tombstone when creating cache entry
row_cache: Kill incomplete_tag
row_cache: Save one key compare on direct hit
"
This PR removes _pending_ranges and _pending_ranges_map in token_metadata.
This removal of makes copying of token_metadata faster and reduces the chance to cause reactor stall.
Refs: #7220
"
* asias-token_metadata_replication_config_less_maps:
token_metadata: Remove _pending_ranges
token_metadata: Get rid of unused _pending_ranges_map
from Asias.
This series follows "repair: Add progress metrics for node ops #6842"
and adds the metrics for the remaining node operations,
i.e., replace, decommission and removenode.
Fixes#1244, #6733
* asias-repair_progress_metrics_replace_decomm_removenode:
repair: Add progress metrics for removenode ops
repair: Add progress metrics for decommission ops
repair: Add progress metrics for replace ops
Change 94995acedb added yielding to abstract_replication_strategy::do_get_ranges.
And 07e253542d used get_ranges_in_thread in compaction_manager.
However, there is nothing to prevent token_metadata, and in particular its
`_sorted_tokens` from changing while iterating over them in do_get_ranges if the latter yields.
Therefore copy the the replication strategy `_token_metadata` in `get_ranges_in_thread(inet_address ep)`.
If the caller provides `token_metadata` to get_ranges_in_thread, then the caller
must make sure that we can safely yield while accessing token_metadata (like
in `do_rebuild_replace_with_repair`).
Fixes#7044
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200915074555.431088-1-bhalevy@scylladb.com>
- Remove get_pending_ranges and introduce has_pending_ranges, since the
caller only needs to know if there is a pending range for the keyspace
and the node.
- Remove print_pending_ranges which is only used in logging. If we
really want to log the new pending token ranges, we can log when we
set the new pending token ranges.
This removal of _pending_ranges makes copying of token_metadata faster
and reduces the chance to cause reactor stall.
Refs: #7220
"
The range_tombstone_list provides an abstraction to work with
sorted list of range tombstones with methods to add/retrive
them. However, there's a tombstones() method that just returns
modifiable reference to the used collection (boost::intrusive_set)
which makes it hard to track the exact usage of it.
This set encapsulates the collaction of range tombstones inside
the mentioned ..._list class.
tests: unit(dev)
"
* 'br-range-tombstone-encapsulate-collection' of https://github.com/xemul/scylla:
range_tombstone_list: Do not expose internal collection
range_tombstone_list: Introduce and use pop-and-lock helper
range_tombstone_list: Introduce and use pop_as<>()
flat_mutation_reader: Use range_tombstone_list begin/end API
repair: Mark some partition_hasher methods noexcept
hashers: Mark hash updates noexcept
The histogram constructor has a `counts` parameter defaulted to
`defaultdict(int)`. Due to how default argument values work in
python -- the same value is passed to all invocations -- this results in
all histogram instances sharing the same underlying counts dict. Solve
it the way this is usually solved -- default the parameter to `None` and
when it is `None` create a new instance of `defaultdict(int)` local to
the histogram instance under construction.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200908142355.1263568-1-bdenes@scylladb.com>
Currently blobs are converted to python bytes objects and printed by
simply converting them to string. This results in hard to read blobs as
the bytes' __str__() attempts to interpret the data as a printable
string. This patch changes this to use bytes.hex() which prints blobs in
hex format. This is much more readable and it is also the format that
scylla uses when printing blobs.
Also the conversion to bytes is made more efficient by using gdb's
gdb.inferior.read_memory() function to read the data.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200911085439.1461882-1-bdenes@scylladb.com>
The patch extracts a little helper function that populates
a schema_mutation with various column information.
Will be used in a subsequent patch to serialize column mappings.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This helper function is similar to the ordinary `unfreeze` of
`frozen_mutation` but in addition to the schema_ptr supplies a
custom column_mapping which is being used when upgrading the
mutation.
Needed for a subsequent patch regarding column mappings history.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
hugepages and libhugetlbfs-bin packages is only required for DPDK mode,
and unconditionally installation causes error on offline mode, so drop it.
Fixes#7182
data::cell targets 8KB as its maximum allocations size to avoid
pressuring the allocator. This 8KB target is used for internal storage
-- values small enough to be stored inside the cell itself -- as well
for external storage. Externally stored values use 8KB fragment sizes.
The problem is that only the size of data itself was considered when
making the allocations. For example when allocating the fragments
(chunks) for external storage, each fragment stored 8KB of data. But
fragments have overhead, they have next and back pointers. This resulted
in a 8KB + 2 * sizeof(void*) allocation. IMR uses the allocation
strategy mechanism, which works with aligned allocations. As the seastar
allocation only guarantees aligned allocations for power of two sizes,
it ends up allocating a 16KB slot. This results in the mutation fragment
using almost twice as much memory as would be required. This is a huge
waste.
This patch fixes the problem by considering the overhead of both
internal and external storage ensuring allocations are 8KB or less.
Fixes: #6043
Tests: unit(debug, dev, release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200910171359.1438029-1-bdenes@scylladb.com>
Instead of using the default hasher, hasing specializations should
use the hasher type they were specialized for. It's not a correctness
issue now because the default hasher (xx_hasher) is compatible with
its predecessor (legacy_xx_hasher_without_null_digest), but it's better
to be future-proof and use the correct type in case we ever change the
default hasher in a backward-incompatible way.
Message-Id: <c84ce569d12d9b4f247fb2717efa10dc2dabd75b.1600074632.git.sarna@scylladb.com>
Based on https://github.com/scylladb/scylla/issues/7199,
it looks like storage_service::set_tables_autocompaction
may be called on shards other than 0.
Use run_with_api_lock to both serialize the action
and to check _initialized on shard 0.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In preparation to calling is_initialized() which may yield.
Plus, the way the tables vector is currently captured is inefficient.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Checks for features introduced over 2 years ago were removed
in previous commits, so all that is left is removing the feature
bits itself. Note that the feature strings are still sent
to other nodes just to be double sure, but the new code assumes
that all these features are implicitly enabled.
Streaming with RPC stream is supported for over 2 years and upgrades
are only allowed from versions which already have the support,
so the checks are hereby dropped.
The reader fills up the buffer upon construction, which is not what other
readers do, and is considered to be waste of cycles, as the reader can be
dropped early.
Refs #1671
test: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200910171134.11287-2-xemul@scylladb.com>
xxhash algorithm is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
Write failure reply is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
Correct non-compound range tombstones are supported for over 2 years
and upgrades are only allowed from versions which already have the
support, so the checks are hereby dropped.
Large partitions are supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
Digest multipartition read is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
Correct counter order is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
"
Migration manager installs several cluster feature change listeners.
The listeners will call update_schema_version_and_announce() when cluster
features are enabled, which does this:
return update_schema_version(proxy, features).then([] (utils::UUID uuid) {
return announce_schema_version(uuid);
});
It first updates the schema version and then publishes it via
gossip in announce_schema_version(). It is possible that the
announce_schema_version() part of the first schema change will be
deferred and will execute after the other four calls to
update_schema_version_and_announce(). It will install the old schema
version in gossip instead of the more recent one.
The fix is to serialize schema digest calculation and publishing.
Refs #7200
This problem also brought my attention to initialization code, which could be
prone to the same problem.
The storage service computes gossiper states before it starts the
gossiper. Among them, node's schema version. There are two problems with that.
First is that computing the schema version and publishing it is not
atomic, so is not safe against concurrent schema changes or schema
version recalculations. It will not exclude with
recalculate_schema_version() calls, and we could end up with the old
(and incorrect) schema version being advertised in gossip.
Second problem is that we should not allow the database layer to call
into the gossiper layer before it is fully initialized, as this may
produce undefined behavior.
Maybe we're not doing concurrent schema changes/recalculations now,
but it is easy to imagine that this could change for whatever reason
in the future.
The solution for both problems is to break the cyclic dependency
between the database layer and the storage_service layer by having the
database layer not use the gossiper at all. The database layer
publishes schema version inside the database class and allows
installing listeners on changes. The storage_service layer asks the
database layer for the current version when it initializes, and only
after that installs a listener which will update the gossiper.
Tests:
- unit (dev)
- manual (3 node ccm)
"
* tag 'fix-schema-digest-calculation-race-v1' of github.com:tgrabiec/scylla:
db, schema: Hide update_schema_version_and_announce()
db, storage_service: Do not call into gossiper from the database layer
db: Make schema version observable
utils: updateable_value_source: Introduce as_observable()
schema: Fix race in schema version recalculation leading to stale schema version in gossip
"
This pull request fixes unified relocatable package dependency issues in
other build modes than release, and then adds unified tarball to the
"dist" build target.
Fixes#6949
"
* 'penberg/build/unified-to-dist/v1' of github.com:penberg/scylla:
configure.py: Build unified tarball as part of "dist" target
unified/build_unified: Use build/<mode>/dist/tar for dependency tarballs
configure.py: Use build/<mode>/dist/tar for unified tarball dependencies
This patch is a proposal for the removal of the redis
classes describing the commands. 'prepare' and 'execute'
class functions have been merged into a function with
the name of the command.
Note: 'command_factory' still needs to be simplified.
Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200913183315.9437-1-etienne.adam@gmail.com>
it's failing as so:
Python Exception <class 'TypeError'> unsupported operand type(s) for +: 'int' and 'str':
it's a regression caused by e4d06a3bbf.
_mask() should use the ref stored in the ctor to dereference _impl.
Fixes#7058.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200908154342.26264-1-raphaelsc@scylladb.com>
Define container types near the containing elements' hook
members, so that they could be private without the need
to friend classes with each other.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The cache_tracker tries to access private member of the
rows_entry to unlink it, but the lru_type is auto_unlink
and can unlink itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage service computes gossiper states before it starts the
gossiper. Among them, node's schema version. There are two problems with that.
First is that computing the schema version and publishing it is not
atomic, so is not safe against concurrent schema changes or schema
version recalculations. It will not exclude with
recalculate_schema_version() calls, and we could end up with the old
(and incorrect) schema version being advertised in gossip.
Second problem is that we should not allow the database layer to call
into the gossiper layer before it is fully initialized, as this may
produce undefined behavior.
The solution for both problems is to break the cyclic dependency
between the database layer and the storage_service layer by having the
database layer not use the gossiper at all. The database layer
publishes schema version inside the database class and allows
installing listeners on changes. The storage_service layer asks the
database layer for the current version when it initializes, and only
after that installs a listener which will update the gossiper.
This also allows us to drop unsafe functions like update_schema_version().
Migration manager installs several feature change listeners:
if (this_shard_id() == 0) {
_feature_listeners.push_back(_feat.cluster_supports_view_virtual_columns().when_enabled(update_schema));
_feature_listeners.push_back(_feat.cluster_supports_digest_insensitive_to_expiry().when_enabled(update_schema));
_feature_listeners.push_back(_feat.cluster_supports_cdc().when_enabled(update_schema));
_feature_listeners.push_back(_feat.cluster_supports_per_table_partitioners().when_enabled(update_schema));
}
They will call update_schema_version_and_announce() when features are enabled, which does this:
return update_schema_version(proxy, features).then([] (utils::UUID uuid) {
return announce_schema_version(uuid);
});
So it first updates the schema version and then publishes it via
gossip in announce_schema_version(). It is possible that the
announce_schema_version() part of the first schema change will be
deferred and will execute after the other four calls to
update_schema_version_and_announce(). It will install the old schema
version in gossip instead of the more recent one.
The fix is to serialize schema digest calculation and publishing.
Refs #7200
It causes gdb to print UUIDs like this:
"a3eadd80-f2a7-11ea-853c-", '0' <repeats 11 times>, "4"
This is quite hard to read, let's drop the string display hint, so they
are displayed like this:
a3eadd80-f2a7-11ea-853c-000000000004
Much better. Also technically UUID is a 128 bit integer anyway, not a
string.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200911090135.1463099-1-bdenes@scylladb.com>
The build_unified.sh script has the same bug as configure.py had: it
looks for the python tarball in
build/<mode>/scylla-python3-package.tar.gz, but it's never generated
there. Fix up the problem by using build/<mode>/dist/tar location for
all dependency tarballs.
The build target for scylla-unified-package.tar.gz incorrectly depends
on "build/<mode>/scylla-python3-package.tar.gz", which is never
generated. Instead, the package is either generated in
"build/release/scylla-python3-package.tar.gz" (for legacy reasons) or
"build/<mode>/dist/tar/scylla-python3-package.tar.gz". This issues
causes building unified package in other modes to fail.
To solve the problem, let's switch to using the "build/<mode>/dist/tar"
locations for unified tarball dependencies, which is the correct place
to use anyway.
When the test suite is run with Scylla serving in HTTPS mode, using
test/alternator/run --https, two Alternator Streams tests failed.
With this patch fixing a bug in the test, the tests pass.
The bug was in the is_local_java() function which was supposed to detect
DynamoDB Local (which behaves in some things differently from the real
DynamoDB). When that detection code makes an HTTPS request and does not
disable checking the server's certificate (which on Alternator is
self-signed), the request fails - but not in the way that the code expected.
So we need to fix the is_local_java() to allow the failure mode of the
self-signed certificate. Anyway, this case is *not* DynamoDB Local so
the detection function would return false.
Fixes#7214
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200910194738.125263-1-nyh@scylladb.com>
This patch adds regression tests for four recently-fixed issues which did not yet
have tests:
Refs #7157 (LatestStreamArn)
Refs #7158 (SequenceNumber should be numeric)
Refs #7162 (LatestStreamLabel)
Refs #7163 (StreamSpecification)
I verified that all the new tests failed before these issues were fixed, but
now pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200907155334.562844-1-nyh@scylladb.com>
"
This series fixes a bug in `appending_hash<row>` that caused it to ignore any cells after the first NULL. It also adds a cluster feature which starts using the new hashing only after the whole cluster is aware of it. The series comes with tests, which reproduce the issue.
Fixes#4567
Based on #4574
"
* psarna-fix_ignoring_cells_after_null_in_appending_hash:
test: extend mutation_test for NULL values
tests/mutation: add reproducer for #4567
gms: add a cluster feature for fixed hashing
digest: add null values to row digest
mutation_partition: fix formatting
appending_hash<row>: make publicly visible
The test is extended for another possible corner case:
[1, NULL, 2] vs [1, 2, NULL] should have different digests.
Also, a check for legacy behavior is added.
With the new hashing routine, null values are taken into account
when computing row digest. Previous behavior had a regression
which stopped computing the hash after the first null value
is encountered, but the original behavior was also prone
to errors - e.g. row [1, NULL, 2] was not distinguishable
from [1, 2, NULL], because their hashes were identical.
This hashing is not yet active - it will only be used after
the next commit introduces a proper cluster feature for it.
appending_hash<row> specialisation is declared and defined in a *.cc file
which means it cannot have a dedicated unit test. This patch moves the
declaration to the corresponding *.hh file.
There was a typo in get_column_defs_for_filtering(): it checked the
wrong pointer before dereferencing. Add a test exposing the NULL
dereference and fix the typo.
Tests: unit (dev)
Fixes#7198.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
In commit 7d86a3b208 (storage_service:
Make replacing node take writes), application state of TOKENS of the
replacing node is added into gossip and propagated to the cluster after
the initial start of gossip service. This can cause a race below
1. The replacing node replaces the old dead node with the same ip address
2. The replacing node starts gossip without application state of the TOKENS
3. Other nodes in the cluster replace the application states of old dead node's
version with the new replacing node's version
4. replacing node dies
5. replace operation is performed again, the TOKENS application state is
not preset and replace operation fails.
To fix, we can always add TOKENS application state when the
gossip service starts.
Fixes: #7166
Backports: 4.1 and 4.2
The clustering_row class looks as a decorated deletable_row, but
re-implements all its logic (and members). Embed the deletable_row
into clustering_row and keep the non-static row logic in one
class instead of two.
We don't want to update scylla-python3 submodule for every python3 dependency
update, bring python3 package list to python3-dependencies.txt, pass it on
package building time.
See #6702
See scylladb/scylla-python3#6
[avi: add
* tools/python3 19a9cd3...b4e52ee (1):
> Allow specify package dependency list by --packages
to maintain bisectability]
* seastar 4ff91c4c3a...52f0f38994 (4):
> rpc: Return actual chosen compressor in server reponse - not all avail
Fixes#6925.
> net/tls: fix compilation guards around sec_param().
> future: improved printing of seastar::nested_exception
> merge: http: fix issues with the request parser and testing
When configuration files for perftune contain invalid parameter, scylla_prepare
may cause traceback because error handling is not eough.
Throw all errors from create_perftune_conf(), catch them on scylla_prepare,
print user understandable error.
Fixes#6847
The clustering_row is deletable_row + clustering_key, all
its internals work exactly as the relevant deletable_row's
ones.
The similar relation is between static_row and row, and
the former wrapes the latter, so here's the same trick
for the non-static row classes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The deletable_row accepts clustering_row in constructor and
.apply() method. The next patch will make clustering_row
embed the deletable_row inside, so those two methods will
violate layering and should be fixed in advance.
The fix is in providing a clustering_row method to convert
itself into a deletable_row. There are two places that need
this: mutation_fragment_applier and partition_snapshot_row_cursor.
Both methods pass temporary clustering_row value, so the
method in question is also move-converter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
check_and_repair_cdc_streams, in case it decides to create a new CDC
generation, updates the STATUS application state so that other nodes
gossiped with pick up the generation change.
The node which runs check_and_repair_cdc_streams also learns about a
generation change: the STATUS update causes a notification change.
This happens during add_local_application_state call
which caused the STATUS update; it would lead to calling
handle_cdc_generation, which detects a generation change and calls
add_local_application_state with the new generation's timestamp.
Thus, we get a recursive add_local_application_state call. Unforunately,
the function takes a lock before doing on_change notifications, so we
get a deadlock.
This commit prevents the deadlock.
We update the local variable which stores the generation timestamp
before updating STATUS, so handle_cdc_generation won't consider
the observed generation to be new, hence it won't perform the recursive
add_local_application_state call.
Hints writes are handled by storage_proxy in the exact same way
regular writes are, which in turn means that the same smp service
group is used for both. The problem is that it can lead to a priority
inversion where writes of the lower priority kind occupies a lot of
the semaphores units making the higher priority writes wait for an
empty slot.
This series adds a separate smp group for hints as well as a field
to pass the correct smp group to mutate_locally functions, and
then uses this field to properly classify the writes.
Fixes#7177
* eliransin-hint_priority_inversion:
Storage proxy: use hints smp group in mutate locally
Storage proxy: add a dedicated smp group for hints
Changing a user type may allow adding apparently duplicate rows to
tables where this type is used in a partitioning key. Fix by checking
all types of existing partitioning columns before allowing to add new
fields to the type.
Fixes#6941
The log-structured allocator (LSA) reserves memory when performing
operations, since its operations are performed with reclaiming disabled
and if it runs out, it cannot evict cache to gain more. The amount of
memory to reserve is remembered across calls so that it does not have
to repeat the fail/increase-reserve/retry cycle for every operation.
However, we currently lack decaying the amount to reserve. This means
that if a single operation increased the reserve in the distant past,
all current operations also require this large reserve. Large reserves
are expensive since they can cause large amounts of cache to be evicted.
This patch adds reserve decay. The time-to-decay is inversely proportional
to reserve size: 10GB/reserve. This means that a 20MB reserve is halved
after 500 operations (10GB/20MB) while a 20kB reserve is halved after
500,000 operations (10GB/20kB). So large, expensive reserves are decayed
quickly while small, inexpensive reserves are decayed slowly to reduce
the risk of allocation failures and exceptions.
A unit test is added.
Fixes#325.
This patch adds support for 2 hash commands HDEL and HGETALL.
Internally it introduces the hashes_result_builder class to
read hashes and stored them in a std::map.
Other changes:
- one exception return string was fixed
- tests now use pytest.raises
Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200907202528.4985-1-etienne.adam@gmail.com>
We are using mutate_locally to handle hint mutations that arrived
through RPC. The current implementation makes no distinction whether
the mutation came through hint verb or a mutation verb resulting in
using the same smp group for both. This commit adds the ability to
reference different smp group in mutate_locally private calls and
makes the handlers pass the correct smp group to mutate_locally.
Due to #7175, microseconds are stored in a db_clock::time_point
as if they were milliseconds.
std::chrono::duration_cast<std::chrono::nanoseconds> may cause overflow
and end up with invalid/negative nanos.
This change specializes time_point_to_string to std::chrono::milliseconds
since it's currently only called to print db_clock::time_point
and uses boost::posix_time::milliseconds to print the count.
This would generate an exception in today's time stamps
and the output will look like:
1599493018559873 milliseconds (Year is out of valid range: 1400..9999)
instead of:
1799-07-16T19:57:52.175010
It is preferrable to print the numeric value annotated as out of valid range
than to print a bogus date in the past.
Test: unit(dev), commitlog_test:TestCommitLog.test_mixed_mode_commitlog_same_partition_smp_1
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200907162845.147477-1-bhalevy@scylladb.com>
Now all work with the list is described as API calls, it's
finally possible to stop exposing the boost::set outside
the class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's an optimization in flat_mutation_reader_from_mutations
that folds the list from left-to-right in linear time. In case
of currently used boost::set the .unlink_leftmost_without_rebalance
helper is used, so wrap this exception with a method of the
range_tombstone_list. This is the last place where caller need
to mess with the exact internal collection.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method extracts an element from the list, constructs
a desired object from it and frees. This is common usage
of range_tombstone_list. Having a helper helps encapsulating
the exact collection inside the class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to stop revealing the exact collection from
the range_tombstone_list, so make use of existing begin/end
methods and extend with rbegin() where needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The net patch will change the way range tombstones are
fed into hasher. To make sure the codeflow doesn't
become exception-unsafe, mark the relevant methods as
nont-throwing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All those methods end up with library calls, whose code
is not marked noexcept, but is such according to code
itself or docs.
The primary goal is to make some repair partition_hasher
methods noexcept (next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/7160
By Calle Wilund:
Stream descriptions, as returned from create, update and describe stream was missing "latest" stream arn.
Shard descriptions sequence number (for us timeuuid:s) were formatted wrongly. The spec states they should be numeric only.
Both these defects break Kinesis operations.
alternator: Set CDC delta to keys only for alternator streams
alternator: Include stream spec in desc for create/update/describe
alternator: Include LatestStreamLabel in resulting desc for create/update table
alternator: Make "StreamLabel" an iso8601 timestamp
alternator: Alloc BILLING_MODE in update_table
cdc: Add setter for delta mode
alternator: Fix sequence number range using wrong format
alternator: Include stream arn in table description if enabled
Add new validate_with_error_position function
which returns -1 if data is a valid UTF-8 string
or otherwise a byte position of first invalid
character. The position is added to exception
messages of all UTF-8 parsing errors in Scylla.
validate_with_error_position is done in two
passes in order to preserve the same performance
in common case when the string is valid.
Fixes#7190
Since we don't use any delta value when translating cdc -> streams
it is wasteful to write these to the log table, esp. since we already
write big fat pre- and post images.
Fixes#7163
If enabled, the resulting table description should include a
StreamDescription object with the appropriate members describing
current stream settings.
Previous patches removed those `BOOST_REQUIRE*` macros that could be
invoked from shards other than 0. The reason is that said macros are not
thread-safe, so calling them from multiple shards produces mangled
output to stdout as well as the XML report file. It was assumed that
only these invocations -- from a non-0 shard -- are problematic, but it
turns out even these can race with seastar log messages emitted from
other shards. This patch removes all such macros, replacing them with
the thread safe `require*` functions from `test/lib/test_utils.hh`.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200907125309.1199104-1-bdenes@scylladb.com>
Hints and regular writes currently uses the same cross shard
operation semaphore, which can lead to priority inversion, making
cross shard writes wait for cross shard hints. This commit adds
an smp_service_group for hints and adds it usage in the mutate_hint
function.
Fixes#7158
A streams shard descriptions has a sequence range describing start/end
(if available) of the shard. This is specified as being "numeric only".
Alternator incorrectly used UUID here, which breaks kinesis.
v2:
* Fix uint128_t parsing from string. bmp::number constructor accepted
sstring, but did not interpret it as std::string/chars. Weird results.
As seen in https://github.com/scylladb/scylla/issues/7175,
1e676cd845
that was merged in bc77939ada
exposed a preexisting problem in time_point_to_string
where it tried printing a timestamp that was in microseconds
(taken from an api::timestamp_type instead of db_clock::time_point)
and hit `boost::wrapexcept<boost::gregorian::bad_year> (Year is out of valid range: 1400..9999)`
If hit, this patch with print the offending time_stamp in
nanoseconds and the error message.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200907083303.33229-1-bhalevy@scylladb.com>
Fixes#7157
When creating/altering/describing a table, if streams are enabled, the
"latest active" stream arn should be included as LatestStreamArn.
Not doing so breaks java kinesis.
This improves the build documentation beyond just packaging:
- Explain how the configure.py step works
- Explain how to build just executables and tests (for development)
- Explain how to build for specific build mode if you didn't specify a
build mode in configure.py step
- Fix build artifact locations, missing .debs, and add executables and
tests
Message-Id: <20200904084443.495137-1-penberg@iki.fi>
T& tp may have other period than milliseconds.
Cast the time_point duration to nanoseconds (or microseconds
if boost doesn't supports it) so it is printed in the best possible
resolution.
Note that we presume that the time_point epoch is the
Unix epoch of 1970-01-01, but the c++ standard doesn't guwarntee
that. See https://github.com/scylladb/scylla/issues/5498
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200906171106.690872-1-bhalevy@scylladb.com>
Documentation states that `SCYLLA_LWT_OPTIMIZATION_META_BIT_MASK`
is a 32-bit integer that represents bit mask. What it fails to mention
is that it's a unsigned value and in fact it takes value of 2147483648.
This is problematic for clients in languages that don't have unsigned
types (like Java).
This patch improves the documentation to make it clear that
`SCYLLA_LWT_OPTIMIZATION_META_BIT_MASK` is represented by unsigned
value.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <7166b736461ae6f3d8ffdf5733e810a82aa02abc.1599382184.git.piotr@scylladb.com>
statement_restrictions::process_partition_key_restrictions() was
checking has_unrestricted_components(), whereas just an empty() check
suffices there, because has_unrestricted_components() is implicitly
checked five lines down by needs_filtering().
The replacement check is cheaper and simpler to understand.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Previously batch statement result set included rows for only
those updates which have a prefetch data present (i.e. there
was an "old" (pre-existing) row for a key).
Also, these rows were sorted not in the order in which statements
appear in the batch, but in the order of updated clustering keys.
If we have a batch which updates a few non-existent keys, then
it's impossible to figure out which update inserted a new key
by looking at the query response. Not only because the responses
may not correspond to the order of statements in the batch, but
even some rows may not show up in the result set at all.
Please see #7113 on Github for detailed description
of the problem:
https://github.com/scylladb/scylla/issues/7113
The patch set proposes the following fix:
For conditional batch statements the result set now always
includes a row for each LWT statement, in the same order
in which individual statements appear in the batch.
This way we can always tell which update did actually insert
a new key or update the existing one.
Technically, the following changes were made:
* `update_parameters::prefetch_data::row::is_in_cas_result_set`
member removed as well as the supporting code in
`cas_request::applies_to` which iterated through cas updates
and marked individual `prefetch_data` rows as "need to be in
cas result set".
* `cas_request::applies_to` substantially simplified since it
doesn't do anything more than checking `stmt.applies_to()`
in short-circuiting manner.
* `modification_statement::build_cas_result_set` method moved
to `cas_request`. This allows to easily iterate through
individual `cas_row_update` instances and preserve the order
of the rows in the result set.
* A little helper `cas_request::find_old_row`
is introduced to find a row in `prefetch_data` based on the
(pk, ck) combination obtained from the current `cas_request`
and a given `cas_row_update`.
* A few tests for the issue #7113 are written, other lwt-batch-related
tests adjusted accordingly.
Previously batch statement result set included rows for only
those updates which have a prefetch data present (i.e. there
was an "old" (pre-existing) row for a key).
Also, these rows were sorted not in the order in which statements
appear in the batch, but in the order of updated clustering keys.
If we have a batch which updates a few non-existent keys, then
it's impossible to figure out which update inserted a new key
by looking at the query response. Not only because the responses
may not correspond to the order of statements in the batch, but
even some rows may not show up in the result set at all.
The patch proposes the following fix:
For conditional batch statements the result set now always
includes a row for each LWT statement, in the same order
in which individual statements appear in the batch.
This way we can always tell which update did actually insert
a new key or update the existing one.
`update_parameters::prefetch_data::row::is_in_cas_result_set`
member variable was removed as well as supporting code in
`cas_request::applies_to` which iterated through cas updates
and marked individual `prefetch_data` rows as "need to be in
cas result set".
Instead now `cas_request::applies_to` is significantly
simplified since it doesn't do anything more than checking
`stmt.applies_to()` in short-circuiting manner.
A few tests for the issue are written, other lwt-batch-related
tests were adjusted accordingly to include rows in result set
for each statement inside conditional batches.
Tests: unit(dev, debug)
Co-authored-by: Konstantin Osipov <kostja@scylladb.com>
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This is just a plain move of the code from `modification_statement`
to `cas_request` without changes in the logic, which will further
help to refactor `build_cas_result_set` behavior to include a row
for each LWT statement and order rows in the order of statements
in a batch.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Factor out little helper function which finds a pre-existing
row for a given `cas_row_update` (matching the primary key).
Used in `cas_request::applies_to`.
Will be used in a subsequent patch to move
`modification_statement::build_cas_result_set` into `cas_request`.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The partitions_type::lower_bound() method can return a hint that saves
info about the "lower-ness of the bound", in particular when the search
key is found, this can be guessed from the hint without comparison.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The row_cache::find_or_create is only used to put (or touch) an entry in cache
having the partition_start mutation at hands. Thus, theres no point in carrying
key reference and tombstone value through the calls, just the partition_start
reference is enough.
Since the new cache entry is created incomplete, rename the creation method
to reflect this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only caller of find_or_create() in tests works on already existing (.populate()-d) entry,
so patch this place for explicity and for the sake of next patching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the key for new partition is copied inside do_find_or_create_entry we may call
this function without allocator set, as it sets the allocator inside.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the missing partition is created in cache the decorated key is copied from
the ring position view too early -- to do the lookup. However, the read context
had been already entered the partition and already has the decorated key on board,
so for lookup we can use the reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We use a lot of TLS variables yet GDB is not of much help when working
with these. So in this patch I document where they are located in
memory, how to calculate the address of a known TLS variable and how to
find (identify) one given an address.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200903131802.1068288-1-bdenes@scylladb.com>
* seastar 7f7cf0f232...4ff91c4c3a (47):
> core/reactor: complete_timers(): restore previous scheduling group
Fixes#7117.
> future: Drop spurious 'mutable' from a lambda
> future: Don't put a std::tuple in future_state
> future: Prepare for changing the future_state storage type
> future: Add future_state::get0()
> when_all: Replace another untuple with get0
> when_all: Use get0 instead of untuple
> rpc: Don't assume that a future stores a std::tuple
> future: Add a futurize_base
> testing: Stop boost from installing its own signal handler
> future: Define futurize::value_type with future::value_type
> future: Move futurize down the file
> Merge "Put logging onto {fmt} rails" from Pavel E
> Merge "Make future_state non variadic" from Rafael
> sharded: propagate sharded instances stop error
> log: Fix misprint in docs
> future: Use destroy_at in destructor
> future: Add static_assert rejecting variadic future_state
> futures_test: Drop variadic test
> when_all: Drop a superfluous use of futurize::from_tuples
> everywhere: Use future::get0 when appropriate
> net: upgrade sockopt comments to doxygen
> iostream: document the read_exactly() function
> net:tls: fix clang-10 compilation duration cast
> future: Move continuation_base_from_future to future.hh
> repeat: Drop unnecessary make_tuple
> shared_future: Don't use future_state_type
> future: Add a static_assert against variadic futures
> future: Delete warn_variadic_future
> rpc_demo: Don't use a variadic future
> rpc_test: Don't use a variadic future
> futures_test: Don't use a variadic future
> future: Move disable_failure_guard to promise::schedule
> net: add an interface for custom socket options
> posix: add one more setsockopt overload
> Merge "Simplify pollfn and its inheritants" from Pavel E
> util:std-compat.hh: add forward declaration of std::pmr for clang-10
> rpc: Add protocol::has_handlers() helper
> Add a Seastar_DEBUG_ALLOCATIONS build option
> futures_test: Add a test for futures of references
> future: Simplify destruction of future_state
> Use detect_stack_use_after_return=1
> repeat: Fix indentation
> repeat: Delete try/catch
> repeat: Simplify loop
> Avoid call to std::exception_ptr's destructor from a .hh
> file: Add missing include
"
Before seastar is updated with the {fmt} engine under the
logging hood, some changes are to be made in scylla to
conform to {fmt} standards.
Compilation and tests checked against both -- old (current)
and new seastar-s.
tests: unit(dev), manual
"
* 'br-logging-update' of https://github.com/xemul/scylla:
code: Force formatting of pointer in .debug and .trace
code: Format { and } as {fmt} needs
streaming: Do not reveal raw pointer in info message
mp_row_consumer: Provide hex-formatting wrapper for bytes_view
heat_load_balance: Include fmt/ranges.h
This patch adds a test for the TRIM_HORIZON option of GetShardIterator in
Alternator Streams. This option asks to fetch again *all* the available
history in this shard stream. We had an implementation for it, but not a
test - so this patch adds one. The test passes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200830131458.381350-1-nyh@scylladb.com>
Alternator Streams already support the AT_SEQUENCE_NUMBER and
AFTER_SEQUENCE_NUMBER options for iterators. These options allow to replay
a stream of changes from a known position or after that known position.
However, we never had a test verifying that these features actually work
as intended, beyond just checking syntax. Having such tests is important
because recently we changed the implementation of these iterators, but
didn't have a test verifying that they still work.
So in this patch we add such tests. The tests pass (as usual, on both
Alternator and DynamoDB).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200830115817.380075-1-nyh@scylladb.com>
We had a test, test_streams_last_result, that verifies that after reading from
an Alternator Stream the last event, reading again will find nothing.
But we didn't actually have a test which checks that if at that point a new event
*does* arrive, we can read it. This test checks this case, and it passes (we don't
have a bug there, but it's good as a regression test for NextShardIterator).
This test also verifies that after reading an event for a particular key on a
a specific stream "shard", the next event for the same key will arrive on the
same shard.
This test passes on both Alternator and DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200830105744.378790-1-nyh@scylladb.com>
A compaction strategy, that supports parallel compaction, may want to know
if the table has compaction running on its behalf before making a decision.
For example, a size-tiered-like strategy may not want to trigger a behavior,
like cross-tier compaction, when there's ongoing compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200901134306.23961-1-raphaelsc@scylladb.com>
If available, use the recently added
`reader_concurrency_semaphore::_initial_resources` to calculate the
amount of memory used out of the initially configured amount. If not
available, the summary falls back to the previous mode of just printing
the remaining amount of memory.
Example:
Replica:
Read Concurrency Semaphores:
user sstable reads: 11/100, 263621214/ 42949672 B, queued: 847
streaming sstable reads: 0/ 10, 0/ 42949672 B, queued: 0
system sstable reads: 1/ 10, 251584/ 42949672 B, queued: 0
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200901091452.806419-1-bdenes@scylladb.com>
This patch fixes a bug which caused sporadic failures of the Alternator
test - test_streams.py::test_streams_last_result.
The GetRecords operation reads from an Alternator Streams shard and then
returns an "iterator" from where to continue reading next time. Because
we obviously don't want to read the same change again, we "incremented"
the current position, to start at the incremented position on the next read.
Unfortunately, the implementation of the increment() function wasn't quite
right. The position in the CDC log is a timeuuid, which has a really bizarre
comparison function (see compare_visitor in types.cc). In particular the
least-sigificant bytes of the UUID are compared as *signed* bytes. This
means that if the last byte of the UUID was 127, and increment() increased
it to 128, and this was wrong because the comparison function later deemed
that as a signed byte, where 128 is lower than 127, not higher! The result
was that with 1/256 probability (whenever the last byte of the position was
127) we would return an item twice. This was reproduced (with 1/256
probability) by the test test_streams_last_result, as reported in issue #7004.
The fix in this patch is to drop the increment() and replace it by a flag
whether an iterator is inclusive of the threshold (>=) or exclusive (>).
The internal representation of the iterator has a boolean flag "inclusive",
and the string representation uses the prefixes "I" or "i" to indicate an
inclusive or exclusive range, respectively - whereas before this patch we
always used the prefix "I".
Although increment() could have been fixed to work correctly, the result would
have been ugly because of the weirdness of the timeuuid comparison function.
increment() would also require extensive new unit-tests: we were lucky that
the high-level functional tests caught a 1 in 256 error, but they would not
have caught rarer errors (e.g., 1 in 2^32). Furthermore, I am looking at
Alternator as the first "user" of CDC, and seeing how complicated and
error-prone increment() is, we should not recommend to users to use this
technique - they should use exclusive (>) range queries instead.
Fixes#7004.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200901102718.435227-1-nyh@scylladb.com>
Users can set python3 and sysconfdir from cmdline of install.sh
according to the install mode (root or nonroot) and distro type.
It's helpful to correct the default python3/sysconfdir, otherwise
setup scripts or scylla-jmx doesn't work.
Fixes#7130
Signed-off-by: Amos Kong <amos@scylladb.com>
If the install.sh is executed by sudo, then the tmp scylla.yaml is owned
by root. It's difficult to overwrite it by non-privileged users.
Signed-off-by: Amos Kong <amos@scylladb.com>
Commit a6ad70d3da changed the format of
stream IDs: the lower 8 bytes were previously generated randomly, now
some of them have semantics. In particular, the least significant byte
contains a version (stream IDs might evolve with further releases).
This is a backward-incompatible change: the code won't properly handle
stream IDs with all lower 8 bytes generated randomly. To protect us from
subtle bugs, the code has an assertion that checks the stream ID's
version.
This means that if an experimental user used CDC before the change and
then upgraded, they might hit the assertion when a node attempts to
retrieve a CDC generation with old stream IDs from the CDC description
tables and then decode it.
In effect, the user won't even be able to start a node.
Similarly as with the case described in
d89b7a0548, the simplest fix is to rename
the tables. This fix must get merged in before CDC goes out of
experimental.
Now, if the user upgrades their cluster from a pre-rename version, the
node will simply complain that it can't obtain the CDC generation
instead of preventing the cluster from working. The user will be able to
use CDC after running checkAndRepairCDCStreams.
Since a new table is added to the system_distributed keyspace, the
cluster's schema has changed, so sstables and digests need to be
regenerated for schema_digest_test.
Merged pull request https://github.com/scylladb/scylla/pull/7121
By Calle Wilund:
Refs #7095Fixes#7128
CDC delta!=full both relied on post-filtering
to remove generated log row and/or cells. This is inefficient.
Instead, simply check if the data should be created in the
visitors.
Also removed delta_mode=off mode.
cdc: Remove post-filterings for keys-only/off cdc delta generation
cdc: Remove cdc delta_mode::off
Refs #7095
CDC delta!=full both relied on post-filtering
to remove generated log row and/or cells. This is inefficient.
Instead, simply check if the data should be created in the
visitors.
v2:
* Fixed delta logs rows created (empty) even when delta == off
v3:
* Killed delta == off
v4:
* Move checks into (const) member var(s)
Fixes#7128
CDC logs are not useful without at least delta_mode==keys, since
pre/post image data has no info on _what_ was actually done to
base table in source mutation.
The following metric is added:
scylla_node_maintenance_operations_removenode_finished_percentage{shard="0",type="gauge"} 0.650000
It is the number of finished percentage for removenode operation so
far.
Fixes#1244, #6733
The following metric is added:
scylla_node_maintenance_operations_decommission_finished_percentage{shard="0",type="gauge"}
0.650000
It is the number of finished percentage for decommission operation so
far.
Fixes#1244, #6733
The following metric is added:
scylla_node_maintenance_operations_replace_finished_percentage{shard="0",type="gauge"} 0.650000
It is the number of finished percentage for replace operation so far.
Fixes#1244, #6733
hget and hset commands using hashes internally, thus
they are not using the existing write_strings() function.
Limitations:
- hset only supports 3 params, instead of multiple field/value
list that is available in official redis-server.
- hset should return 0 when the key and field already exists,
but I am not sure it's possible to retrieve this information
without doing read-before-write, which would not be atomic.
I factorized a bit the query_* functions to reduce duplication, but
I am not 100% sure of the naming, it may still be a bit confusing
between the schema used (strings, hashes) and the returned format
(currently only string but array should come later with hgetall).
Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200830190128.18534-1-etienne.adam@gmail.com>
Try to look up and use schema from the local schema_registry
in case when we have a schema mismatch between the most recent schema
version and the one that is stored inside the `frozen_mutation` for
the accepted proposal.
When such situation happens the stored `frozen_mutation` is able
to be applied only if we are lucky enough and column_mapping in
the mutation is "compatible" with the new table schema.
It wouldn't work if, for example, the columns are reordered, or
some columns, which are referenced by an LWT query, are dropped.
With the patch we are able to mitigate these cases as long as the
referenced schema is still present in the node cache (e.g.
it didn't restart/crash or the cache entry is not too old
to be evicted).
Tests: unit(dev, debug), dtest(paxos_tests.schema_mismatch_*_test)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200827150844.624017-1-pa.solodovnikov@scylladb.com>
The get_natural_endpoints returns the list of nodes holding a key.
There is a variation of the method that gets the key as string, the
current implementation just cast the string to bytes_view, which will
not work. Instead, this patch changes the implementation to use
from_nodetool_style_string to translate the key (in a nodetool like
format) to a token.
Fixes#7134
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
After issue #7107 was fixed (regarding the correctness of OldImage and NewImage
in Alternator Streams) we forgot to remove the "xfail" tag from one of the tests
for this issue.
This test now passes, as expected, so in this patch we remove the xfail tag.
Refs #7107
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200827103054.186555-1-nyh@scylladb.com>
alternator/getting-started.md had a missing grave accent (`) character,
resulting in messed up rendering of the involved paragraph. Add the missing
quote.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200827110920.187328-1-nyh@scylladb.com>
The lockup:
When view_builder starts all shards at some point get to a
barrier waiting for each other to pass. If any shard misses
this checkpoint, all others stuck forever. As this barrier
lives inside the _started future, which in turn is waited
on stop, the stop stucks as well.
Reasons to miss the barrier -- exception in the middle of the
fun^w start or explicit abort request while waiting for the
schema agreement.
Fix the "exception" case by unlocking the barrier promise with
exception and fix the "abort request" case by turning it into
an exception.
The bug can be reproduced by hands if making one shard never
see the schema agreement and continue looping until the abort
request.
The crash:
If the background start up fails, then the _started future is
resolved into exception. The view_builder::stop then turns this
future into a real exception caught-and-rethrown by main.cc.
This seems wrong that a failure in a background fiber aborts
the regular shutdown that may proceed otherwise.
tests: unit(dev), manual start-stop
branch: https://github.com/xemul/scylla/tree/br-view-builder-shutdown-fix-3fixes: #7077
Patch #5 leaves the seastar::async() in the 1-st phase of the
start() although can also be tuned not to produce a thread.
However, there's one more (painless) issue with the _sem usage,
so this change appears too large for the part of the bug-fix
and will come as a followup.
* 'br-view-builder-shutdown-fix-3' of git://github.com/xemul/scylla:
view_builder: Add comment about builder instances life-times
view_builder: Do sleep abortable
view_builder: Wakeup barrier on exception
view_builder: Always resolve started future to success
view_builder: Re-futurize start
view_builder: Split calculate_shard_build_step into two
view_builder: Populate the view_builder_init_state
view_builder: Fix indentation after previous patch
view_builder: Introduce view_builder_init_state
Merged pull request https://github.com/scylladb/scylla/pull/7063
By Calle Wilund:
Fixes#6935
DynamoDB streams for some reason duplicate the record keys
into both the "Keys" and "OldImage"/"NewImage" sub-objects
when doing GetRecords. But only if there is other data
to include.
This patch appends the pk/ck parts into old/new image
iff we had any record data.
Updated to handle keys-only updates, and distinguish creating vs. updating rows. Changes cdc to not generate preimage
for non-existent/deleted rows, and also fixes missing operations/ttls in keys-only delta mode.
alternator_streams: Include keys in OldImage/NewImage
cdc: Do not generate pre/post image for non-existent rows
Fixes#6935Fixes#7107
DynamoDB streams for some reason duplicate the record keys
into both the "Keys" and "OldImage"/"NewImage" sub-objects
when doing GetRecords.
This patch appends the pk/ck parts into old/new image, and
also removes the previous restrictions on image generation
since cdc now generates more consistent pre/post image
data.
Fixes#7119Fixes#7120
If preimage select came up empty - i.e. the row did not exist, either
due to never been created, or once delete, we should not bother creating
a log preimage row for it. Esp. since it makes it harder to interpret the
cdc log.
If an operation in a cdc batch did a row delete (ranged, ck, etc), do
not generate postimage data, since the row does no longer exist.
Note that we differentiate deleting all (non-pk/ck) columns from actual
row delete.
... and tests. Printin a pointer in logs is considered to be a bad practice,
so the proposal is to keep this explicit (with fmt::ptr) and allow it for
.debug and .trace cases.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two places that want to print "{<text>}" strings, but do not format
the curly braces the {fmt}-way.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Showing raw pointer values in logs is not considered to be good
practice. However, for debugging/tracing this might be helpful.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
By default {fmt} doesn't know how to format this type (although it's a
basic_string_view instantiated), and even providing formatter/operator<<
does not help -- it anyway hits an earlier assertion in args mapper about
the disallowance of character types mixing.
The hex-wrapper with own operator<< solves the problem.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
The view_info object, which is attached to the schema object of the
view, contains a data structure called
"base_non_pk_columns_in_view_pk". This data structure contains column
ids of the base table so is valid only for a particular version of the
base table schema. This data structure is used by materialized view
code to interpret mutations of the base table, those coming from base
table writes, or reads of the base table done as part of view updates
or view building.
The base table schema version of that data structure must match the
schema version of the mutation fragments, otherwise we hit undefined
behavior. This may include aborts, exceptions, segfaults, or data
corruption (e.g. writes landing in the wrong column in the view).
Before this patch, we could get schema version mismatch here after the
base table was altered. That's because the view schema did not change
when the base table was altered.
Another problem was that view building was using the current table's schema
to interpret the fragments and invoke view building. That's incorrect for two
reasons. First, fragments generated by a reader must be accessed only using
the reader's schema. Second, base_non_pk_columns_in_view_pk of the recorded
view ptrs may not longer match the current base table schema, which is used
to generate the view updates.
Part of the fix is to extract base_non_pk_columns_in_view_pk into a
third entity called base_dependent_view_info, which changes both on
base table schema changes and view schema changes.
It is managed by a shared pointer so that we can take immutable
snapshots of it, just like with schema_ptr. When starting the view
update, the base table schema_ptr and the corresponding
base_dependent_view_info have to match. So we must obtain them
atomically, and base_dependent_view_info cannot change during update.
Also, whenever the base table schema changes, we must update
base_dependent_view_infos of all attached views (atomically) so that
it matches the base table schema.
Fixes#7061.
Tests:
- unit (dev)
- [v1] manual (reproduced using scylla binary and cqlsh)
"
* tag 'mv-schema-mismatch-fix-v2' of github.com:tgrabiec/scylla:
db: view: Refactor view_info::initialize_base_dependent_fields()
tests: mv: Test dropping columns from base table
db: view: Fix incorrect schema access during view building after base table schema changes
schema: Call on_internal_error() when out of range id is passed to column_at()
db: views: Fix undefined behavior on base table schema changes
db: views: Introduce has_base_non_pk_columns_in_view_pk()
If one shard delays in seeing the schema agreement and returns on
abort request, other shards may get stuck waiting for it on the
status read barrier. Luckily with the previous patch the barrier
is exception-proof, so we may abort the waiting loop with exception
and handle the lock-up.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If an exception pops up during the view_builder::start while some
shards wait for the status-read barrier, these shards are not woken
up, thus causing the shutdown to stuck.
Fix this by setting exception on the barrier promise, resolving all
pending and on-going futures.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If the view builder background start fails, the _started future resolves
to exceptional state. In turn, stopping the view builder keeps this state
through .finally() and aborts the shutdown very early, while it may and
should proceed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Step two turning the view_builder::start() into a chain of lambdas --
rewrite (most of) the seastar::async()'s lambda into a more "classical"
form.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The calculate_shard_build_step() has a cross-shard barrier in the middle and
passing the barrier is broken wrt exceptions that may happen before it. The
intention is to prepare this barrier passing for exception handling by turning
the view_builder::start() into a dedicated continuation lambda.
Step one in this campaign -- split the calculate_shard_build_step() into
steps called by view_builder::start():
- before the barrier
- barrier
- after the barrier
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Keep the internal calculate_shard_build_step()'s stuff on the init helper
struct, as the method in question is about to be split into a chain of
continuation lambdas.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the helper initialization struct that will carry the needed
objects accross continuation lambdas.
The indentation in ::start() will be fixed in the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
This series adds scylla repairs command to help debug repair.
Fixes#7103
"
* asias-repair_help_debug_scylla_repairs_cmd:
scylla-gdb.py: Add scylla repairs command
repair: Add repair_state to track repair states
scylla-gdb.py: Print the pointers of elements in boost_intrusive_list_printer
scylla-gdb.py: Add printer for gms::inet_address
scylla-gdb.py: Fix a typo in boost_intrusive_list
repair: Fix the incorrect comments for _all_nodes
repair: Add row_level_repair object pointer in repair_meta
repair: Add counter for reads issued and finished for repair_reader
"
Issue https://github.com/scylladb/scylla/issues/7019 describes a problem of an ever-growing map of temporary values stored in query_options. In order to mitigate this kind of problems, the storage for temporary values is moved from an external data structure to the value views itself. This way, the temporary lives only as long as it's accessible and is automatically destroyed once a request finishes. The downside is that each temporary is now allocated separately, while previously they were bundled in a single byte stream.
Tests: unit(dev)
Fixes https://github.com/scylladb/scylla/issues/7019
"
* psarna-move_temporaries_to_value_view:
cql3: remove query_options::linearize and _temporaries
cql3: remove make_temporary helper function
cql3: store temporaries in-place instead of in query_options
cql3: add temporary_value to value view
cql3: allow moving data out of raw_value
cql3: split values.hh into a .cc file
With relatively big summaries, reactor can be stalled for a couple
of milliseconds.
This patch:
a. allocates positions upfront to avoid excessive reallocation.
b. returns a future from seal_summary() and uses `seastar::do_for_each`
to iterate over the summary entries so the loop can yield if necessary.
Fixes#7108.
Based on 2470aad5a389dfd32621737d2c17c7e319437692 by Raphael S. Carvalho <raphaelsc@scylladb.com>
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200826091337.28530-1-bhalevy@scylladb.com>
Per sheduling-group data was moved from the task queues to a separate
data member in the reactor itself. Update `scylla memory` to use the new
location to get the per sheduling group data for the storage proxy
stats.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200825140256.309299-1-bdenes@scylladb.com>
We copy a list, which was reported to generate a 15ms stall.
This is easily fixed by moving it instead, which is safe since this is
the last use of the variable.
Fixes#7115.
If file_writer::close() fails to close the output stream
closing will be retried in file_writer::~file_writer,
leading to:
```
include/seastar/core/future.hh:1892: seastar::future<T ...> seastar::promise<T>::get_future() [with T = {}]: Assertion `!this->_future && this->_state && !this->_task' failed.
```
as seen in https://github.com/scylladb/scylla/issues/7085Fixes#7085
Test: unit(dev), database_test with injected error in posix_file_impl::close()
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200826062456.661708-1-bhalevy@scylladb.com>
This patch adds tests for the "NewImage" attribute in Alternator Streams
in NEW_IMAGE and NEW_AND_OLD_IMAGES mode.
It reproduces issue #7107, that items' key attributes are missing in the
NewImage. It also verifies the risky corner cases where the new item is
"empty" and NewImage should include just the key, vs. the case where the
item is deleted, so NewImage should be missing.
This test currently passes on AWS DynamoDB, and xfails on Alternator.
Refs #7107.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200825113857.106489-1-nyh@scylladb.com>
Never trust Occam's Razor - it turns out that the use-after-free bug in the
"exists" command was caused by two separate bugs. We fixed one in commit
9636a33993, but there is a second one fixed in
this patch.
The problem fixed here was that a "service_permit" object, which is designed to
be copied around from place to place (it contains a shared pointer, so is cheap
to copy), was saved by reference, and the reference was to a function argument
and was destroyed prematurely.
This time I tested *many times* that that test_strings.py passes on both dev and
debug builds.
Note that test/run/redis still fails in a debug build, but due to a different
problem.
Fixes#6469
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200825183313.120331-1-nyh@scylladb.com>
test/redis/README.md suggests that when running "pytest" the default is to connect
to a local redis on localhost:6379. This default was recently lost when options
were added to use a different host and port. It's still good to have the default
suggested in README.md.
It also makes it easier to run the tests against the standard redis, which by
default runs on localhost:6379 - by just running "pytest".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200825195143.124429-1-nyh@scylladb.com>
query_options::linearize was the only user of _temporaries helper
attribute, and it turns out that this function is never used -
- and is therefore removed.
When a value_view needs to store a temporarily instantiated object,
it can use the new variant field. The temporary value will live
only as long as the view itself.
Use repair_state to track the major state of repair from the beginning
to the end of repair.
With this patch, we can easily know at which state both the repair
master and followers are. It is very helpful when debugging a repair
hang issue.
Refs #7103
We saw errors in killed_wiped_node_cannot_join_test dtest:
Aug 2020 10:30:43 [node4] Missing: ['A node with address 127.0.76.4 already exists, cancelling join']:
The test does:
n1, n2, n3, n4
wipe data on n4
start n4 again with the same ip address
Without this patch, n4 will bootstrap into the cluster new tokens. We
should prevent n4 to bootstrap because there is an existing
node in the cluster.
In shadow round, the local node should apply the application state of
the node with the same ip address. This is useful to detect a node
trying to bootstrap with the same IP address of an existing node.
Tests: bootstrap_test.py
Fixes: #7073
Fixes#6900
Clustered range deletes did not clear out the "row_states" data associated
with affected rows (might be many).
Adds a sweep through and erases relevant data. Since we do pre- and
postimage in "order", this should only affect postimage.
Fix the default number of test repeats to 1, which it was before
(spotted by Nadav). Also, prefix the options so that they become
"--test-repeat" and "--test-timeout" (spotted by Avi).
Message-Id: <20200825081456.197210-1-penberg@scylladb.com>
It is useful to distinguish if the repair is a regular repair or used
for node operations.
In addition, log the keyspace and tables are repaired.
Fixes#7086
A check, to validate that counter column cannot be added into non-counter table,
is missing for alter table statement. Validation is performed when building new
schema, but it's limited to checking that a schema will not contain both counter
and non-counter columns.
Due to lack of validation, the added counter column could be incorrectly
persisted to the schema, but this results in a crash when setting the new
schema to its table. On restart, it can be confirmed that the schema change
was indeed persisted when describing the table.
This problem is fixed by doing proper validation for the alter table statement,
which consists of making sure a new counter column cannot be added to a
non-counter table.
The test cdc_disallow_cdc_for_counters_test is adjusted because one of its tests
was built on the assumption that counter column can be added into a non-counter
table.
Fixes#7065.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200824155709.34743-1-raphaelsc@scylladb.com>
To verify offline installed image with do_verify_package() in scylla_setup, we introduce is_offline() function works just like is_nonroot() but the install not with --nonroot.
* syuu1228-do_verify_package_for_offline:
scylla_setup: verify package correctly on offline install
scylla_util.py: implement is_offline() to detect offline installed image
do_verify_package written only for .rpm/.deb, does not working correctly
for offline install(including nonroot).
We should check file existance for the environment, not by package existance
using rpm/dpkg.
Fixes#7075
A missing "&" caused the key stored in a long-living command to be copied
and the copy quickly freed - and then used after freed.
This caused the test test_strings.py::test_exists_multiple_existent_key for
this feature to frequently crash.
Fixes#6469
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200823190141.88816-1-nyh@scylladb.com>
Consider:
1) Start n1,n2,n3
2) Stop n3
3) Start n4 to replace n3 but list n4 as seed node
4) Node n4 finishes replacing operation
5) Restart n2
6) Run SELECT * from system.peers on node or node 1.
cqlsh> SELECT * from system.peers ;
peer| data_center | host_id| preferred_ip | rack | release_version | rpc_address | schema_version| supported_features| tokens
127.0.0.3 |null |null | null | null | null |null |null |null | {'-90410082611643223', '5874059110445936121'}
The replaced old node 127.0.0.3 shows in system.peers.
(Note, since commit 399d79fc6f (init: do not allow
replace-address for seeds), step 3 will be rejected. Assume we use a version without it)
The problem is that n2 sees n3 is in gossip status of SHUTDOWN after
restart. The storage_service::handle_state_normal callback is called for
127.0.0.3. Since n4 is using different token as n3 (seed node does not bootstrap so
it uses new tokens instead of tokens of n3 which is being replaced), so
owned_tokens will be set. We see logs like:
[shard 0] storage_service - handle_state_normal: New node 127.0.0.3 at token 5874059110445936121
[shard 0] storage_service - Host ID collision for cbec60e5-4060-428e-8d40-9db154572df7 between 127.0.0.4
and 127.0.0.3; ignored 127.0.0.3
As a result, db::system_keyspace::update_tokens
will be called to write to system.peers for 127.0.0.3 wrongly.
if (!owned_tokens.empty()) {
db::system_keyspace::update_tokens(endpoint, owned_tokens)
}
To fix, we should skip calling db::system_keyspace::update_tokens if the
nodes is present in endpoints_to_remove.
Refs: #4652
Refs: #6397
Just like test/alternator, make redis-test runnable from test.py.
For this we move the redis tests into a subdirectory of tests/,
and create a script to run them: tests/redis/run.
These tests currently fail, so we did not yet modify test.py to actually
run them automatically.
Fixes#6331
"
There's last call for global storage service left in compaction code, it
comes from cleanup_compaction to get local token ranges for filtering.
The call in question is a pure wrapper over database, so this set just
makes use of the database where it's already available (perform_cleanup)
and adds it where it's needed (perform_sstable_upgrade).
tests: unit(dev), nodetool upgradesstables
"
* 'br-remove-ss-from-compaction-3' of https://github.com/xemul/scylla:
storage_service: Remove get_local_ranges helper
compaction: Use database from options to get local ranges
compaction: Keep database reference on upgrade options
compaction: Keep database reference on cleanup options
db: Factor out get_local_ranges helper
This patch causes orphaned hints (hints that were written towards a node
that is no longer their replica) to be sent with a default write
timeout. This is what is currently done for non-orphaned hints.
Previously, the timeout was hardcoded to one hour. This could cause a
long delay while shutting down, as hints manager waits until all ongoing
hint sending operation finish before stopping itself.
Fixes: #7051
This reader was probably created in ancient times, when readers didn't
yet have a _schema member of their own. But now that they do, it is not
necessary to store the schema in the reader implementation, there is one
available in the parent class.
While at it also move the schema into the class when calling the
constructor.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200821070358.33937-1-bdenes@scylladb.com>
"
We keep refrences to locator::token_metadata in many places.
Most of them are for read-only access and only a few want
to modify the token_metadata.
Recently, in 94995acedb,
we added yielding loops that access token_metadata in order
to avoid cpu stalls. To make that possible we need to make
sure they token_metadata object they are traversing won't change
mid-loop.
This series is a first step in ensuring the serialization of
updates to shared token metadata to reading it.
Test: unit(dev)
Dtest: bootstrap_test:TestBootstrap.start_stop_test{,_node}, update_cluster_layout_tests.py -a next-gating(dev)
"
* tag 'constify-token-metadata-access-v2' of github.com:bhalevy/scylla:
api/http_context: keep a const sharded<locator::token_metadata>&
gossiper: keep a const token_metadata&
storage_service: separate get_mutable_token_metadata
range_streamer: keep a const token_metadata&
storage_proxy: delete unused get_restricted_ranges declaration
storage_proxy: keep a const token_metadata&
storage_proxy: get rid of mutable get_token_metadata getter
database: keep const token_metadata&
database: keyspace_metadata: pass const locator::token_metadata& around
everywhere_replication_strategy: move methods out of line
replication_strategy: keep a const token_metadata&
abstract_replication_strategy: get_ranges: accept const token_metadata&
token_metadata: rename calculate_pending_ranges to update_pending_ranges
token_metadata: mark const methods
token_ranges: pending_endpoints_for: return empty vector if keyspace not found
token_ranges: get_pending_ranges: return empty vector if keyspace not found
token_ranges: get rid of unused get_pending_ranges variant
replication_strategy: calculate_natural_endpoints: make token_metadata& param const
token_metadata: add get_datacenter_racks() const variant
The cleanup compaction wants to keep local tokens on-board and gets
them from storage_service.get_local_ranges().
This method is the wrapper around database.get_keyspace_local_ranges()
created in previous patch, the live database reference is already
available on the descriptor's options, so we can short-cut the call.
This allows removing the last explicit call for global storage_service
instance from compaction code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only place that creates them is the API upgrade_sstables call.
The created options object doesn't over-survive the returned
future, so it's safe to keep this reference there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The database is available at both places that create the options --
tests and API perform_cleanup call.
Options object doesn't over-survive the returned future, so it's
safe to keep the reference on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Storage service and repair code have identical helpers to get local
ranges for keyspace. Move this helper's code onto database, later it
will be reused by one more place.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Use a different getter for a token_metadata& that
may be changed so we can better synchronize readers
and writers of token_metadata and eventually allow
them to yield in asynchronous loops.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We'd like to strictly control who can modify token metadata
and nobody currently needs a mutable reference to storage_proxy::_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
No need to modify token_metadata form database code.
Also, get rid of mutable get_token_metadata variant.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move methods depending on token_metadata to source file
so we can avoid including token_metadata.hh in header files
where spossible.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Many token_metadata methods do not modify the object
and can be marked as const.
The motivation is to better control who may modify
token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is no longer called once for a given view_info, so the name
"initialize" is not appropriate.
This patch splits the "initialize" method into the "make" part, which
makes a new base_info object, and the "set" part, which changes the
current base_info object attached to the view.
The view building process was accessing mutation fragments using
current table's schema. This is not correct, fragments must be
accessed using the schema of the generating reader.
This could lead to undefined behavior when the column set of the base
table changes. out_of_range exceptions could be observed, or data in
the view ending up in the wrong column.
Refs #7061.
The fix has two parts. First, we always use the reader's schema to
access fragments generated by the reader.
Second, when calling populate_views() we upgrade the fragment-wrapping
reader's schema to the base table schema so that it matches the base
table schema of view_and_base snapshots passed to populate_views().
The view_info object, which is attached to the schema object of the
view, contains a data structure called
"base_non_pk_columns_in_view_pk". This data structure contains column
ids of the base table so is valid only for a particular version of the
base table schema. This data structure is used by materialized view
code to interpret mutations of the base table, those coming from base
table writes, or reads of the base table done as part of view updates
or view building.
The base table schema version of that data structure must match the
schema version of the mutation fragments, otherwise we hit undefined
behavior. This may include aborts, exceptions, segfaults, or data
corruption (e.g. writes landing in the wrong column in the view).
Before this patch, we could get schema version mismatch here after the
base table was altered. That's because the view schema does not change
when the base table is altered.
Part of the fix is to extract base_non_pk_columns_in_view_pk into a
third entitiy called base_dependent_view_info, which changes both on
base table schema changes and view schema changes.
It is managed by a shared pointer so that we can take immutable
snapshots of it, just like with schema_ptr. When starting the view
update, the base table schema_ptr and the corresponding
base_dependent_view_info have to match. So we must obtain them
atomically, and base_dependent_view_info cannot change during update.
Also, whenever the base table schema changes, we must update
base_dependent_view_infos of all attached views (atomically) so that
it matches the base table schema.
Refs #7061.
When user defines a build mode with configure.py, the build, check, and
test targets fail as follows:
./configure.py --mode=dev && ninja build
ninja: error: 'debug-build', needed by 'build', missing and no known rule to make it
Fix the issue by making the targets depend on build targets for
specified build modes, not all available modes.
Message-Id: <20200813105639.1641090-1-penberg@scylladb.com>
We already have a test, test_streams.py::test_streams_updateitem_old_image,
for issue #6935: It tests that the OLD_IMAGE in Alternator Streams should
contain the item's key.
However this test was missing one corner case, which is the first solution
for this issue did incorrectly. So in this patch we add a test for this
corner case, test_streams_updateitem_old_image_empty_item:
This corner case about the item existing, but *empty*, i.e., having just
the key but no other attribute. In this case, OLD_IMAGE should return that
empty item - including its key. Not nothing.
As usual, this test passes on DynamoDB and xfails on Alternator, and the
"xfail" mark will be removed when issue #6935 is fixed.
Refs #6935.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200819155229.34475-1-nyh@scylladb.com>
When binding prepared statement it is possible that values being binded
are not correct. Unfortunately before this patch, the error message
was only saying what type got a wrong value. This was not very helpful
because there could be multiple columns with the same type in the table.
We also support collections so sometimes error was saying that there
is a wrong value for a type T but the affected column was actually of
type collection<T>.
This patch adds information about a column name that got the wrong
value so that it's easier to find and fix the problem.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <90b70a7e5144d7cb3e53876271187e43fd8eee25.1597832048.git.piotr@scylladb.com>
"
The messaging service is (as many other services) present in
the global namespace and is widely accessed from where needed
with global get(_local)?_messaging_service() calls. There's a
long-term task to get rid of this globality and make services
and componenets reference each-other and, for and due-to this,
start and stop in specific order. This set makes this for the
messaging service.
The service is very low level and doesn't depend on anything.
It's used by gossiper, streaming, repair, migration manager,
storage proxy, storage service and API. According to this
dependencies the set consists of several parts:
patches 1-9 are preparatory, they encapsulate messaging service
init/fini stuff in its own module and decouple it from the
db::config
patch 10-12 introduce local service reference in main and set
its init/fini calls at the early stage so that this reference
can later be passed to those depending on it
patches 13-42 replace global referencing of messaging service
from other subsystems with local references initialized from
main.
patch 43 finalizes tests.
patch 44 wraps things up with removing global messaiging service
instance along with get(_local)?_messaging_service calls.
The service's stopping part is deliberately left incomplete (as
it is now), the sharded service remains alive, only the instance's
stop() method is called (and is empty for a while). Since the
messaging service's users still do not stop cleanly, its instances
should better continue leaking on exit.
Once (if) the seastar gets the helper rpc::has_handlers() method
merged the messaging_service::stop() will be able to check if all
the verbs had been unregistered (spoiler: not yet, more fixes to
come).
For debugging purposes the pointer on now-local messaging service
instance is kept in service::debug namespace.
tests: unit(dev)
dtest(dev: simple_boot_shutdown, repair, update_cluster_layout)
manual start-stop
"
* 'br-unglobal-messaging-service-2' of https://github.com/xemul/scylla: (44 commits)
messaging_service: Unglobal messaging service instance
tests: Use own instances of messaging_service
storage_service: Use local messaging reference
storage_service: Keep reference on sharded messaging service
migration_manager: Add messaging service as argument to get_schema_definition
migration_manager: Use local messaging reference in simple cases
migration_manager: Keep reference on messaging
migration_manager: Make push_schema_mutation private non-static method
migration_manager: Move get_schema_version verb handling from proxy
repair: Stop using global messaging_service references
repair: Keep sharded messaging service reference on repair_meta
repair: Keep sharded messaging service reference on repair_info
repair: Keep reference on messaging in row-level code
repair: Keep sharded messaging service in API
repair: Unset API endpoints on stop
repair: Setup API endpoints in separate helper
repair: Push the sharded<messaging_service> reference down to sync_data_using_repair
repair: Use existing sharded db reference
repair: Mark repair.cc local functions as static
streaming: Keep messaging service on send_info
...
Remove the global messaging_service, keep it on the main stack.
But also store a pointer on it in debug namespace for debugging.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The global one is going away, no core code uses it, so all tests
can be safely switched to use their own instances.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the places the are (and had become such with previous patches) using
the global messaging service and the storage service methods, so they
can access the local reference on the messaging service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are 4 places that call this helper:
- storage proxy. Callers are rpc verb handlers and already have the proxy
at hands from which they can get the messaging service instance
- repair. There's local-global messaging instance at hands, and the caller
is in verb handler too
- streaming. The caller is verb handler, which is unregistered on stop, so
the messaging service instance can be captured
- migration manager itself. The caller already uses "this", so the messaging
service instance can be get from it
The better approach would be to make get_schema_definition be the method of
migration_manager, but the manager is stopped for real on shutdown, thus
referencing it from the callers might not be safe and needs revisiting. At
the same time the messaging service is always alive, so using its reference
is safe.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of those places are either non-static migration_manager methods.
Plus one place where the local service instance is already at hands.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The local migration manager instance is already available at caller, so
we can call a method on it. This is to facilitate next patching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The user of this verb is migration manager, so the handler must be it as well.
The hander code now explicitly gets global proxy. This call is safe, as proxy
is not stopped nowadays. In the future we'll need to revisit the relation
between migration - proxy - stats anyway.
The use of local migration manager is safe, as it happens in verb handler which
is unregistered and is waited to be completed on migration manager stop.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now all the users of messaging service have the needed reference.
Again, the messaging service is not really stopped at the end, so its usage
is safe regardless of whether repair stuff itself leaks on stop or not.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The reference comes from repair_info and storage_service calls, both
had been already patched for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The row-level repair keeps its statics for needed services, same as the
streaming does. Treat the messaging service the same way to stop using
the global one in the next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This unset the roll-back of the correpsonding _set-s. The messaging
service will be (already is, but implicitly) used in repair API
callbacks, so make sure they are unset before the messaging service
is stopped.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There will be the unset part soon, this is the preparation. No functional
changes in api/storage_server.cc, just move the code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This function needs the messaging service inside, but the closest place where it
can get one from is the storage_service API handlers. Temporarily move the call for
global messaging service into storage service, its turn for this cleanup will
come later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The db.invoke_on_all's lambda tries to get the sharded db reference via
the global storage service. This can be done in a much nicer way.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Streaming uses messaging, init it with itw own reference.
Nowadays the whole streaming subsystem uses global static references on the
needed services. This is not nice, but still better than just using code-wide
globals, so treat the messaging service here the same way.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is alive there, so it is safe to pass one to lambda.
Once in forward_fn, it can be used to get messaging from.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of the places that need messaging service in proxy already use
storage_proxy instance, so it is safe to get the local messaging
from it too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The proxy is another user of messaging, so keep the reference on it. Its
real usage will come in next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Gossiper needs messaging service, the messaging is started before the
gossiper, so we can push the former reference into it.
Gossiper is not stopped for real, neither the messaging service is, so
the memory usage is still safe.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Let's reserve space for sharding metadata in advance, to avoid excessive
allocations in create_sharding_metadata().
With the default ignore_msb_bits=12, it was observed that the # of
reallocations is frequently 11-12. With ignore_msb_bits=16, the number
can easily go up to 50.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200814210250.39361-1-raphaelsc@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6978
by Kamil Braun:
These abstractions are used for walking over mutations created by a write
coordinator, deconstructing them into atomic'' pieces (changes''),
and consuming these pieces.
Read the big comment in cdc/change_visitor.hh for more details.
4 big functions were rewritten to use the new abstractions.
tests:
unit (dev build)
all dtests from cdc_tests.py, except cdc_tests.py:TestCdc.cluster_reduction_with_cdc_test, have passed (on a dev build). The test that fails also fails on master.
Part of #5945.
cdc: rewrite process_changes using inspect_mutation
cdc: move some functions out of `cdc::transformer`
cdc: rewrite extract_changes using inspect_mutation
cdc: rewrite should_split using inspect_mutation
cdc: rewrite find_timestamp using inspect_mutation
cdc: introduce a ,,change visitor'' abstraction
"
After data segregation feature, anything that cause out-of-order writes,
like read repair, can result in small updates to past time windows.
This causes compaction to be very aggressive because whenever a past time
window is updated like that, that time window is recompacted into a
single SSTable.
Users expect that once a window is closed, it will no longer be written
to, but that has changed since the introduction of the data segregation
future. We didn't anticipate the write amplification issues that the
feature would cause. To fix this problem, let's perform size-tiered
compaction on the windows that are no longer active and were updated
because data was segregated. The current behavior where the last active
window is merged into one file is kept. But thereafter, that same
window will only be compacted using STCS.
Fixes#6928.
"
* 'fix_twcs_agressiveness_after_data_segregation_v2' of github.com:raphaelsc/scylla:
compaction/twcs: improve further debug messages
compaction/twcs: Improve debug log which shows all windows
test: Check that TWCS properly performs size-tiered compaction on past windows
compaction/twcs: Make task estimation take into account the size-tiered behavior
compaction/stcs: Export static function that estimates pending tasks
compaction/stcs: Make get_buckets() static
compact/twcs: Perform size-tiered compaction on past time windows
compaction/twcs: Make strategy easier to extend by removing duplicated knowledge
compaction/twcs: Make newest_bucket() non-static
compaction/twcs: Move TWCS implementation into source file
Contains patch from Rafael to fix up includes.
* seastar c872c3408c...7f7cf0f232 (9):
> future: Consider result_unavailable invalid in future_state_base::ignore()
> future: Consider result_unavailable invalid in future_state_base::valid()
> Merge "future-util: split header" from Benny
> docs: corrected some text and code-examples in streaming-rpc docs
> future: Reduce nesting in future::then
> demos: coroutines: include std-compat.hh
> sstring: mark str() and methods using it as noexcept
> tls: Add an assert
> future: fix coroutine compilation
Some tests directly reference the global messaging service. For the sake
of simpler patching wrap this global reference with a local one. Once the
global messaging service goes away tests will get their own instances.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
API is one of the subsystems that work with messaging service. To keep
the dependencies correct the related API stuff should be stopped before
the messaging service stops.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are few places that initialize db and system_ks and need the
messaging service. Pass the reference to it from main instead of
using the global helpers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the preparation for moving the message service to main -- keep
a reference and eventually pass one to subsystems depending on messaging.
Once they are ready, the reference will be turned into an instance.
For now only push the reference into the messaging service init/exit
itself, other subsystems will be patched next.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Introduce back the .stop() method that will be used to really stop
the service. For now do not do sharded::stop, as its users are not
yet stopping, so this prevents use-after-free on messaging service.
For now the .stop() is empty, but will be in charge of checking if
all the other users had unregisterd their handlers from rpc.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The messaging service is a low-level one which doesn't need other
services, so it can be started first. Nowadays it's indeed started
before most of its users but one -- the gossiper.
In current code gossiper doesn't do anything with messaging service
until it starts, but very soon this dependency will be expressed in
terms of a refernce from gossiper to messaging_service, thus by the
time the latter starts, the former should already exist.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the init_messaging_service() only deals with messaing service
and related internal stuff, so it can sit in its own module.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This struct is nowadays only used to transport arguments from db::config
to messaging_service::scheduling_config, things get simpler if dropping it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This makes the messaging service configuration completely independent
from the db config. Next step would be to move the messaging service
init code into its module.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the continuation of the previous patch -- change the primary
constructor to work with config. This, in turn, will decouple the
messaging service from database::config.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This service constructor uses and copies many simple values, it would be
much simpler to group them on config. It also helps the next patches to
simplify the messaging service initialization and to keep the defaults
(for testing) in one place.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The init_ms_fd_gossiper function initializes two services, but
effectively consists of two independent parts, so declare them
as such.
The duplication of listen address resolution will go away soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
On today's stop() the messaging service is not really stopped as other
services still (may) use it and have registered handlers in it. Inside
the .stop() only the rpc servers are brought down, so the better name
for this method would be shutdown().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just a cleanup. These internal stoppers must be private, also there
are too many public specifiers in the class description around them.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The current log prints one log entry for each window, it doesn't print
the # of SSTs in the bucket, and the now information is copied across
all the window entries.
previously, it looked like this:
[shard 0] compaction - Key 1597331160000000, now 1597331160000000
[shard 0] compaction - Key 1597331100000000, now 1597331160000000
[shard 0] compaction - Key 1597331040000000, now 1597331160000000
[shard 0] compaction - Key 1597330980000000, now 1597331160000000
this made it harder to group all windows which reflect the state of
the strategy in a given time.
now, it looks like as follow:
[shard 0] compaction - time_window_compaction_strategy::newest_bucket:
now 1597331160000000
buckets = {
key=1597331160000000, size=1
key=1597331100000000, size=2
key=1597331040000000, size=1
key=1597330980000000, size=1
}
Also the level of this log is changed from debug to trace, given that
now it's compressed and only printed once.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The task estimation was not taking into account that TWCS does size-tiered
on the the windows, and it only added 1 to the estimation when there
could be more tasks than that depending on the amount of SSTables in
all the existing size tiers.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That will be useful for allowing other compaction strategies that use
STCS to properly estimate the pending tasks.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
STCS will export a static function to estimate pending tasks, and
it relies on get_buckets() being static too.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
One of the most common code I write when investigating coredumps is
finding all objects of a certain type and creating a human readable
report of certain properties of these objects. This usually involves
retrieving all objects with a vptr with `find_vptrs()` and matching
their type to some pattern. I found myself writing this boilerplate over
and over again, so in this patch I introduce a convenience method to
avoid repeating it in the future.
Message-Id: <20200818145247.2358116-1-bdenes@scylladb.com>
scylla_fiber._walk() has an early return condition on the passed-in
pointer actually being a task pointer. The problem is that the actual
return statement was missing and only an error was logged. This resulted
in execution continuing and further weird errors being printed due to
the code not knowing how to handle the bad pointer.
Message-Id: <20200818144902.2357289-1-bdenes@scylladb.com>
thread_wake_task was moved into an anonymous namespace, add this new
fully qualified name to the task name white-list. Leave the old name for
backward compatibility.
While at it, also add `seastar::thread_context` which is also a task
object, for better seastar thread support.
Message-Id: <20200818142206.2354921-1-bdenes@scylladb.com>
Accessing the element of a tuple from the gdb command line is a
nightmare, add support to collection_element() retrieving one of its
elements to make this easier.
Message-Id: <20200818141123.2351892-1-bdenes@scylladb.com>
Currently we assign the reference to the vector of selected sstables to
`auto sst`. This makes a copy and we pass this local variable to
`do_for_each()`, which will result in a use-after-free if the latter
defers.
Fix by not making a copy and instead just keep the reference.
Fixes: #7060
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200818091241.2341332-1-bdenes@scylladb.com>
"
operator_type is awkward because it's not copyable or assignable. Replace it with a new enum class.
Tests: unit(dev)
"
* dekimir-operator-type:
cql3: Drop operator_type entirely
cql3: Drop operator_type from the parser
cql3/expr: Replace operator_type with an enum
Replace operator_type with the nicer-behaved oper_t in CQL parser and,
consequently, in the relation hierarchy and column_condition.
After this, no references to operator_type remain in live code.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
operator_type is awkward because it's not copyable or assignable.
Replace it in expression representation with a new enum class, oper_t.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
As suggested by Avi, let's move the tarballs from
"build/dist/<mode>/tar" to "build/<mode>/dist/tar" to retain the
symmetry of different build modes, and make the tarballs easier to
discover. While at it, let's document the new tarball locations.
Message-Id: <20200818100427.1876968-1-penberg@scylladb.com>
fea83f6 introduced a race between processing (and hence removing)
sstables from `_sstables_with_tables` and registering new ones. This
manifested in sstables that were added concurrently with processing a
batch for the same sstables being dropped and the semaphore units
associated with them not returned. This resulted in repairs being
blocked indefinitely as the units of the semaphore were effectively
leaked.
This patch fixes this by moving the contents of `_sstables_with_tables`
to a local variable before starting the processing. A unit test
reproducing the problem is also added.
Fixes: #6892
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200817160913.2296444-1-bdenes@scylladb.com>
Except scylla-python3, each scylla package has its own git repository, same package script filename, same build directory structure.
To put python3 thing on scylla repo, we created 'python3' directory on multiple locations, made '-python3' suffixed files, dig deeper build directory not to conflict scylla-server package build.
We should move all scylla-python3 related files to new repository, scylla-python3.
To keep compatibility with current Jenkins script, provide packages on
build/ directory for now.
Fixes#6751
After data segregation feature, anything that cause out-of-order writes,
like read repair, can result in small updates to past time windows.
This causes compaction to be very aggressive because whenever a past time
window is updated like that, that time window is recompacted into a
single SSTable.
Users expect that once a window is closed, it will no longer be written
to, but that has changed since the introduction of the data segregation
future. We didn't anticipate the write amplification issues that the
feature would cause. To fix this problem, let's perform size-tiered
compaction on the windows that are no longer active and were updated
because data was segregated. The current behavior where the last active
window is merged into one file is kept. But thereafter, that same
window will only be compacted using STCS.
Fixes#6928.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
TWCS is hard to extend because its knowledge on what to do with a window
bucket is duplicated in two functions. Let's remove this duplication by
placing the knowledge into a single function.
This is important for the coming change that will perform size-tiered
instead of major on windows that are no longer active.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Previously system.paxos TTL was set as max(3h, gc_grace_seconds).
Introduce new per-table option named `paxos_grace_seconds` to set
the amount of seconds which are used to TTL data in paxos tables
when using LWT queries against the base table.
Default value is equal to `DEFAULT_GC_GRACE_SECONDS`,
which is 10 days.
This change allows to easily test various issues related to paxos TTL.
Fixes#6284
Tests: unit (dev, debug)
Co-authored-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200816223935.919081-1-pa.solodovnikov@scylladb.com>
This is an abstraction for walking over mutations created by a write
coordinator, deconstructing them into ,,atomic'' pieces (,,changes''),
and consuming these pieces.
Read the big comment in cdc/change_visitor.hh for more details.
While Alternator doesn't yet support creating a table with a different
"server-side encryption" (a.k.a. encryption-at-rest) parameters, the
SSESpecification option with Enabled=false should still be allowed, as
it is just the default, and means exactly the same as would a missing
SSESpecification.
This patch also adds a test for this case, which failed on Alternator
before this patch.
Fixes#7031.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200812205853.173846-1-nyh@scylladb.com>
This patch adds a test that attributes which serve as a key for a
secondary index still appear in the OldImage in an Alternator Stream.
This is a special case, because although usually Alternator attributes
are saved as map elements, not stand-alone Scylla columns, in the special
case of secondary-index keys they *are* saved as actual Scylla columns
in the base table. And it turns out we produce wrong results in this case:
CDC's "preimage" does not currently include these columns if they didn't
change, while DynamoDB requires that all columns, not just the changed ones,
appear in OldImage. So the test added in this patch xfails on Alternator
(and as usual, passes on DynamoDB).
Refs #7030.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200812144656.148315-1-nyh@scylladb.com>
The "user.home" system property in JVM does not use the "HOME"
environment variable. This breaks Ant and Maven builds with Podman,
which attempts to look up the local Maven repository in "/root/.m2" when
building tools, for example:
build.xml:757: /root/.m2/repository does not exist.
To fix the issue, let's bind-mount an /etc/passwd file, which contains
host username for UID 0, which ensures that Podman container $USER and $HOME
are the same as on the host.
Message-Id: <20200817085720.1756807-1-penberg@scylladb.com>
"
gossip: Get rid of seed concept
The concept of seed and the different behaviour between seed nodes and
non seed nodes generate a lot of confusion, complication and error for
users. For example, how to add a seed node into into a cluster, how to
promote a non seed node to a seed node, how to choose seeds node in
multiple DC setup, edit config files for seeds, why seed node does not
bootstrap.
If we remove the concept of seed, it will get much easier for users.
After this series, seed config option is only used once when a new node
joins a cluster.
Major changes:
Seed nodes are only used as the initial contact point nodes.
Seed nodes now perform bootstrap. The only exception is the first node
in the cluster.
The unsafe auto_bootstrap option is now ignored.
Gossip shadow round now talks to all nodes instead of just seed nodes.
Refs: #6845
Tests: update_cluster_layout_tests.py + manual test
"
* 'gossip_no_seed_v2' of github.com:asias/scylla:
gossip: Get rid of seed concept
gossip: Introduce GOSSIP_GET_ENDPOINT_STATES verb
gossip: Add do_apply_state_locally helper
gossip: Do not talk to seed node explicitly
gossip: Talk to live endpoints in a shuffled fashion
The concept of seed and the different behaviour between seed nodes and
non seed nodes generate a lot of confusion, complication and error for
users. For example, how to add a seed node into into a cluster, how to
promote a non seed node to a seed node, how to choose seeds node in
multiple DC setup, edit config files for seeds, why seed node does not
bootstrap.
If we remove the concept of seed, it will get much easier for users.
After this series, seed config option is only used once when a new node
joins a cluster.
Major changes:
- Seed nodes are only used as the initial contact point nodes.
- Seed nodes now perform bootstrap. The only exception is the first node
in the cluster.
- The unsafe auto_bootstrap option is now ignored.
- Gossip shadow round now attempts to talk to all nodes instead of just seed nodes.
Manual test:
- bootstrap n1, n2, n3 (n1 and n2 are listed as seed, check only n1
will skip bootstrap, n2 and n3 will bootstrap)
- shtudown n1, n2, n3
- start n2 (check non seed node can boot)
- start n1 (check n1 talks to both n2 and n3)
- start n3 (check n3 talks to both n1 and n3)
Upgrade/Downgrade test:
- Initialize cluster
Start 3 node with n1, n2, n3 using old version
n1 and n2 are listed as seed
- Test upgrade starting from seed nodes
Rolling restart n1 using new version
Rolling restart n2 using new version
Rolling restart n3 using new version
- Test downgrade to old version
Rolling restart n1 using old version
Rolling restart n2 using old version
Rolling restart n3 using old version
- Test upgrade starting from non seed nodes
Rolling restart n3 using new version
Rolling restart n2 using new version
Rolling restart n1 using new version
Notes on upgrade procedure:
There is no special procedure needed to upgrade to Scylla without seed
concept. Rolling upgrade node one by one is good enough.
Fixes: #6845
Tests: ./test.py + update_cluster_layout_tests.py + manual test
Since older binutils on some distribution does not able to handle
compressed debuginfo generated on Fedora, we need to disable it.
However, debian packager force debuginfo compression since debian/compat = 9,
we have to uncompress them after compressed automatically.
Fixes#6982
"
This patch series changes the build system to build all tarballs to
build/dist/<mode>/tar directory. For example, running:
./tools/toolchain/dbuild ./configure.py --mode=dev && ./tools/toolchain/dbuild ninja-build dist-tar
produces the following tarballs in build/dist/dev/tar:
$ ls -1 build/dist/dev/tar/
scylla-jmx-package.tar.gz
scylla-package.tar.gz
scylla-python3-package.tar.gz
scylla-tools-package.tar.gz
This makes it easy to locate release tarballs for humans and scripts. To
preserve backward compatibility, the tarballs are also retained in their
original locations. Once release engineering infrastructure has been
adjusted to use the new locations, we can drop the duplicate copies.
"
* 'penberg/build-dist-tar/v1' of github.com:penberg/scylla:
configure.py: Copy tarballs to build/dist/<mode>/tar directory
configure.py: Add "dist-<component>-tar" targets
reloc/python3: Add "--builddir" to build_deb.sh
configure.py: Use copy-on-write copies when possible
"
Contains several fixes which improve debuggability in situations where
too large column ids are passed to column definition loop methods.
"
* 'schema-range-check-fix' of github.com:tgrabiec/scylla:
schema: Add table name and schema version to error messages
schema: Use on_internal_error() for range check errors
schema: Fix off-by-one in column range check
schema: Make range checks for regular and static columns the same as for clustering columns
select() is too generic for the method that retrieve sstable runs,
and it has a completely different meaning that the former select
method used to select sstables based on token range.
let's give it a more descriptive name.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811193401.22749-1-raphaelsc@scylladb.com>
regression caused by 55cf219c97.
remove_by_toc_name() must work both for a sealed sstable with toc,
and also a partial sstable with tmp toc.
so dirname() should be called conditionally on the condition of
the sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200813160612.101117-1-raphaelsc@scylladb.com>
LCS can have its overlapping invariant broken after operations that can
proceed in parallel to regular compaction like cleanup. That's because
there could be two compactions in parallel placing data in overlapping
token ranges of a given level > 0.
After reshape, the whole table will be rewritten, on restart, if a
given level has more than (fan_out*2)=20 overlaps.
That may sound like enough, but that's not taking into account the
exponential growth in # of SSTables per level, so 20 overlaps may
sound like a lot for level 2 which can afford 100 sstables, but it's
only 2% of level 3, and 0.2% of level 4. So let's change the
overlapping tolerance from the constant of fan_out*2 to 10% of level
limit on # of SSTables, or fan_out, whichever is higher.
Refs #6938.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200810154510.32794-1-raphaelsc@scylladb.com>
After 8014c7124, cleanup can potentially pick a compacting SSTable.
Upgrade and scrub can also pick a compacting SSTable.
The problem is that table::candidates_for_compaction() was badly named.
It misleads the user into thinking that the SSTables returned are perfect
candidates for compaction, but manager still need to filter out the
compacting SSTables from the returned set. So it's being renamed.
When the same SSTable is compacted in parallel, the strategy invariant
can be broken like overlapping being introduced in LCS, and also
some deletion failures as more than one compaction process would try
to delete the same files.
Let's fix scrub, cleanup and ugprade by calling the manager function
which gets the correct candidates for compaction.
Fixes#6938.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811200135.25421-1-raphaelsc@scylladb.com>
Before this patch, modifying cdc/cdc_options.hh required recompiling 264
source files. This is because this header file was included by a couple
other header files - most notably schema.hh, where a forward declaration
would have been enough. Only the handful of source files which really
need to access the CDC options should include "cdc/cdc_options.hh" directly.
After this patch, modifying cdc/cdc_options.hh requires only 6 source files
to be recompiled.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200813070631.180192-1-nyh@scylladb.com>
C++17 introduced try_emplace for maps to replace a pattern:
if(element not in a map) {
map.emplace(...)
}
try_emplace is more efficient and results in a more concise code.
This commit introduces usage of try_emplace when it's appropriate.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <4970091ed770e233884633bf6d46111369e7d2dd.1597327358.git.piotr@scylladb.com>
This testcase was temporarily commented out in 37ebe52, because it
relied on buggy (#6369) behaviour fixed by that commit. Specifically,
it expected a NULL comparison to match a NULL cell value. We now
bring it back, with corrected result expectation.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
The _clients is std::vector, it doesn't have _M_elems.
Luckily there's std_vector() class for it.
The seastar::rpc::server::_conns is unordered_map, not
unordered_set.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200814070858.32383-1-xemul@scylladb.com>
It is needed to print the boost::intrusive::list which is used
by repair_meta_for_masters in repair.
Fixes#7037
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.
`contains` does not only express the intend of the code better but also
does it in more unified way.
This commit replaces all the occurences of the `count` with the
`contains`.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
1. The node1 is shutdown
2. The node1 sends shutdown message to node2
3. The node2 receives gossip shutdown message but the handler yields
4. The node1 is restarted
5. The node1 sends new gossip endpoint_state to node2, node2 applies the state
in apply_state_locally and calls gossiper::handle_major_state_change
and then calls gossiper::mark_alive
6. The shutdown message handler in step 3 resumes and sets status of node1 to SHUTDOWN
7. The gossiper::mark_alive fiber in step 5 resumes and calls gossiper::real_mark_alive,
node2 will skip to mark node1 as alive because the status of node1 is
SHUTDOWN. As a result, node1 is alive but it is not marked as UP by node2.
To fix, we serialize the two operations.
Fixes#7032
Merged pull request https://github.com/scylladb/scylla/pull/7028
By Calle Wilund:
Changes the "preimage" option from binary true/false to on/off/full (accepting true/false, and using old style notation for normal to string - for upgrade reasons), where "full" will force us to include all columns in pre image log rows.
Adds small test (just adding the case to preimage test).
Uses the feature in alternator
Fixes#7030
alternator: Set "preimage" to "full" for streams
cdc_test: Do small test of "full"
cdc: Make pre image optionally "full" (include all columns)
Fixes#7030
Dynamo/alternator streams old image data is supposed to
contain the full old value blob (all keys/values).
Setting preimage=full ensures we get even those properties
that have separate columns if they are not part of an actual
modification.
Makes the "preimage" option for cdc non-binary, i.e. it can now
be "true"/"on", "false"/"off" or "full. The two former behaving like
previously, the latter obviously including all columns in pre image.
Merged pull request https://github.com/scylladb/scylla/pull/7018
by Piotr Sarna:
This series addresses various issues with metrics and semaphores - it mainly adds missing metrics, which makes it possible to see the length of the queues attached to the semaphores. In case of view building and view update generation, metrics was not present in these services at all, so a first, basic implementation is added.
More precise semaphore metrics would ease the testing and development of load shedding and admission control.
view_builder: add metrics
db, view: add view update generator metrics
hints: track resource_manager sending queue length
hints: add drain queue length to metrics
table: add metrics for sstable deletion semaphore
database: remove unused semaphore
The `shard` parameter of `find_db()` is optional and is defaulted to
`None`. When missing, the current shard's database instance is returned.
The problem is that the if condition checking this uses `not shard`,
which also evaluates to `True` if `shard == 0`, resulting in returning
the current shard's database instance for shard 0. Change the condition
to `shard is None` to avoid this.
Fixes: #7016
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200812091546.1704016-1-bdenes@scylladb.com>
"
This series adds progress metrics for the node operations. Metrics for bootstrap and rebuild progress are added as a starter. I will add more for the remaining operations after getting feedback.
With this the Scylla Monitor and Scylla Manager can know the progress of the bootstrap and other node operations. E.g.,
scylla_node_ops_bootstrap_nr_ranges_finished{shard="0",type="derive"} 50
scylla_node_ops_bootstrap_nr_ranges_total{shard="0",type="derive"} 1040
Fixes#1244, #6733
"
* 'repair_progress_metrics_v3' of github.com:asias/scylla:
repair: Add progress metrics for repair ops
repair: Add progress metrics for rebuild ops
repair: Add progress metrics for bootstrap ops
"
It is pretty hard to find the repair_meta object when debugging a core.
This patch makes it is easier by putting repair_meta object created by
both repair follower and master into a map.
Fixes#7009
"
* asias-repair_make_debug_eaiser_track_all_repair_metas:
repair: Add repair_meta_tracker to track repair_meta for followers and masters
repair: Move thread local object _repair_metas out of the function
* 'espindola/move-out-of-line' of https://github.com/espindola/scylla:
test: Move code in sstable_run_based_compaction_strategy_for_tests.hh out of line
test: Drop ifdef now that we always use c++20
test: Move sstable_run_based_compaction_strategy_for_tests.hh to test/lib
It is pretty hard to find the repair_meta object when debugging a core.
This patch makes it is easier by putting repair_meta object created by
both repair follower and master into boost intrusive list.
Fixes#7009
test is currently flaky since system reads can happen
in the background and disturb the global row cache stats.
Use the table's row_cache stats instead.
Fixes#6773
Test: cql_query_test.test_cache_bypass(dev, debug)
Credit-to: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200811140521.421813-1-bhalevy@scylladb.com>
make all required allocations in advance to merging sstables
into _compacting_sstables so it should not throw
after registering some sstables, but not all.
Test: database_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200811132440.416945-1-bhalevy@scylladb.com>
Currently when running against a debug build, our integration test suite
suffers from a ton of timeout related error logs, caused by auth queries
timing out. This causes spurious test failures due to the unexpected
error messages in the log.
This patch increases the timeout for internal distributed auth queries
in debug mode, to give the slow debug builds more headroom to meet the
timeout.
Refs: #6548
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200811145757.1593350-1-bdenes@scylladb.com>
"
Let users disable the unencrypted native transport too by setting the port to
zero in the scylla.yaml configuration file.
Fixes#6997
"
* penberg-penberg/native-transport-disable:
docs/protocol: Document CQL protocol port configuration options
transport: Allow user to disable unencrypted native transport
"
Refs #6148
Separates disk usage into two cases: Allocated and used.
Since we use both reserve and recycled segments, both
which are not actually filled with anything at the point
of waiting.
Also refuses to recycle segments or increase reserve size
if our current disk footprint exceeds threshold.
And finally uses some initial heuristics to determine when
we should suggest flushing, based on disk limit, segment
size, and current usage. Right now, when we only have
a half segment left before hitting used == max.
Some initial tests show an improved adherence to limit
though it will still be exceeded, because we do _not_
force waiting for segments to become cleared or similar
if we need to add data, thus slow flushing can still make
usage create extra segments. We will however attempt to
shrink disk usage when load is lighter.
Somewhat unclear how much this impacts performance
with tight limits, and how much this matters.
"
* elcallio-calle/commitlog_size:
commitlog: Make commitlog respect disk limit better
commitlog: Demote buffer write log messages to trace
The service constructor included a check ensuring that only
standard_role_manager can be used with password_authenticator. But
after 00f7bc6, password_authenticator does not depend on any action of
standard_role_manager. All queries to meta::roles_table in
password_authenticator seem self-contained: the table is created at
the start if missing, and salted_hash is CRUDed independently of any
other columns bar the primary key role_col_name.
NOTE: a nonstandard role manager may not delete a role's row in
meta::roles_table when that role is dropped. This will result in
successful authentication for that non-existing role. But the clients
call check_user_can_login() after such authentication, which in turn
calls role_manager::exists(role). Any correctly implemented role
manager will then return false, and authentication_exception will be
thrown. Therefore, no dependencies exist on the role-manager
behaviour, other than it being self-consistent.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
This path set fixes stalls in repair that are caused by std::list merge and clear operations during test_latency_read_with_nemesis test.
Fixes#6940Fixes#6975Fixes#6976
"
* 'fix_repair_list_stall_merge_clear_v2' of github.com:asias/scylla:
repair: Fix stall in apply_rows_on_master_in_thread and apply_rows_on_follower
repair: Use clear_gently in get_sync_boundary to avoid stall
utils: Add clear_gently
repair: Use merge_to_gently to merge two lists
utils: Add merge_to_gently
The row_diff list in apply_rows_on_master_in_thread and
apply_rows_on_follower can be large. Modify do_apply_rows to remove the
row from the list when the row is consumed to avoid stall when the list
is destroyed.
Fixes#6975
"
Make do_io_check and the io_check functions that
call it noexcept. Up to sstable_write_io_check
and sstable_touch_directory_io_check.
Tests: unit (dev)
"
* tag 'io-check-noexcept-v1' of github.com:bhalevy/scylla:
ssstable: io_check functions: make noexcept
utils: do_io_check: adjust indentation
utils: io_check: make noexcept for future-returning functions
Refs #6148
Separates disk usage into two cases: Allocated and used.
Since we use both reserve and recycled segments, both
which are not actually filled with anything at the point
of waiting.
Also refuses to recycle segments or increase reserve size
if our current disk footprint exceeds threshold.
And finally uses some initial heuristics to determine when
we should suggest flushing, based on disk limit, segment
size, and current usage. Right now, when we only have
a half segment left before hitting used == max.
Some initial tests show an improved adherence to limit
though it will still be exceeded, because we do _not_
force waiting for segments to become cleared or similar
if we need to add data, thus slow flushing can still make
usage create extra segments. We will however attempt to
shrink disk usage when load is lighter.
Somewhat unclear how much this impacts performance
with tight limits, and how much this matters.
v2:
* Add some comments/explanations
v3:
* Made disk footprint subtract happen post delete (non-optimistic)
"
This series adds support for the "md" sstable format.
Support is based on the following:
* do not use clustering based filtering in the presence
of static row, tombstones.
* Disabling min/max column names in the metadata for
formats older than "md".
* When updating the metadata, reset and disable min/max
in the presence of range tombstones (like Cassandra does
and until we process them accurately).
* Fix the way we maintain min/max column names by:
keeping whole clustering key prefixes as min/max
rather than calculating min/max independently for
each component, like Cassandra does in the "md" format.
Fixes#4442
Tests: unit(dev), cql_query_test -t test_clustering_filtering* (debug)
md migration_test dtest from git@github.com:bhalevy/scylla-dtest.git migration_test-md-v1
"
* tag 'md-format-v4' of github.com:bhalevy/scylla: (27 commits)
config: enable_sstables_md_format by default
test: cql_query_test: add test_clustering_filtering unit tests
table: filter_sstable_for_reader: allow clustering filtering md-format sstables
table: create_single_key_sstable_reader: emit partition_start/end for empty filtered results
table: filter_sstable_for_reader: adjust to md-format
table: filter_sstable_for_reader: include non-scylla sstables with tombstones
table: filter_sstable_for_reader: do not filter if static column is requested
table: filter_sstable_for_reader: refactor clustering filtering conditional expression
features: add MD_SSTABLE_FORMAT cluster feature
config: add enable_sstables_md_format
database: add set_format_by_config
test: sstable_3_x_test: test both mc and md versions
test: Add support for the "md" format
sstables: mx/writer: use version from sstable for write calls
sstables: mx/writer: update_min_max_components for partition tombstone
sstables: metadata_collector: support min_max_components for range tombstones
sstable: validate_min_max_metadata: drop outdated logic
sstables: rename mc folder to mx
sstables: may_contain_rows: always true for old formats
sstables: add may_contain_rows
...
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
the code pattern looked like:
<collection>.find(<element>) != <collection>.end()
In C++20 the same can be expressed with:
<collection>.contains(<element>)
This is not only more concise but also expresses the intend of the code
more clearly.
This commit replaces all the occurences of the old pattern with the new
approach.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f001bbc356224f0c38f06ee2a90fb60a6e8e1980.1597132302.git.piotr@scylladb.com>
"
Make sure to close sstable files also on error paths.
Refs #5509Fixes#6448
Tests: unit (dev)
"
* tag 'sstable-close-files-on-error-v6' of github.com:bhalevy/scylla:
sstable: file_writer: auto-close in destructor
sstable: file_writer: add optional filename member
sstable: add make_component_file_writer
sstable: remove_by_toc_name: accept std::string_view
sstable: remove_by_toc_name: always close file and input stream
sstable: delete_sstables: delete outdated FIXME comment
sstable: remove_by_toc_name: drop error_handler parameter
sstable: remove_by_toc_name: make static
sstable: read_toc: always close file
sstable: mark read_toc and methods calling it noexcept
sstable: read_toc: get rid of file_path
sstable: open_data, create_data: set member only on success.
sstable: open_file: mark as noexcept
sstable: new_sstable_component_file: make noexcept
sstable: new_sstable_component_file: close file on failure
sstable: rename_new_sstable_component_file: do not pass file
sstable: open_sstable_component_file_non_checked: mark as noexcept
sstable: open_integrity_checked_file_dma: make noexcept
sstable: open_integrity_checked_file_dma: close file on failure
The following metric is added:
scylla_node_maintenance_operations_repair_finished_percentage{shard="0",type="gauge"} 0.650000
It is the number of finished percentage for all ongoing repair operations.
When all ongoing repair operations finish, the percentage stays at 100%.
Fixes#1244, #6733
Although python2 should be a distant memory by now, the reality is that
we still need to debug scylla on platforms that still have no python3
available (centos7), so we need to keep scylla-gdb.py python2
compatible.
Refs: #7014
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200811093753.1567689-1-bdenes@scylladb.com>
maintainers.md contains a very helpful explanation of how to backport
Seastar fixes to old branches of Scylla, but has a tiny typo, which
this patch corrects.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200811095350.77146-1-nyh@scylladb.com>
Fixes#6948
Changes the stream_id format from
<token:64>:<rand:64>
to
<token:64>:<rand:38><index:22><version:4>
The code will attempt to assert version match when
presented with a stored id (i.e. construct from bytes).
This means that ID:s created by previous (experimental)
versions will break.
Moves the ID encoding fully into the ID class, and makes
the code path private for the topology generation code
path.
Removes some superflous accessors but adds accessors for
token, version and index. (For alternator etc).
Add unit tests reproducing https://github.com/scylladb/scylla/issues/3552
with clustering-key filtering enabled.
enable_sstables_md_format option is set to true as clustering-key
filtering is enabled only for md-format sstables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that it is safe to filter md format sstable by min/max column names
we can remove the `filtering_broken` variable that disabled filtering
in 19b76bf75b to fix#4442.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To prevent https://github.com/scylladb/scylla/issues/3552
we want to ensure that in any case that the partition exists in any
sstable, we emit partition_start/end, even when returning no rows.
In the first filtering pass, filter_sstable_for_reader_by_pk filters
the input sstables based on the partition key, and num_sstables is set the size
of the sstables list after the first filtering pass.
An empty sstables list at this stage means there are indeed no sstables
with the required partition so returning an empty result will leave the
cache in the desired state.
Otherwise, we filter again, using filter_sstable_for_reader_by_ck,
and examine the list of the remaining readers.
If num_readers != num_sstables, we know that
some sstables were filterd by clustering key, so
we append a flat_mutation_reader_from_mutations to
the list of readers and return a combined reader as before.
This will ensure that we will always have a partition_start/end
mutations for the queried partition, even if the filtered
readers emit no rows.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
With the md sstable format, min/max column names in the metadata now
track clustering rows (with or without row tombstones),
range tombstones, and partition tombstones (that are
reflected with empty min/max column names - indicating
the full range).
As such, min and max column names may be of different lengths
due to range tombstones and potentially short clustering key
prefixes with compact storage, so the current matching algorithm
must be changed to take this into account.
To determine if a slice range overlaps the min/max range
we are using position_range::overlaps.
sstable::clustering_components_ranges was renamed to position_range
as it now holds a single position_range rather than a vector of bytes_view ranges.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move contains_rows from table code to sstable::may_contain_rows
since its implementation now has too specific knowledge of sstable
internals.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Static rows aren't reflected in the sstable min/max clustering keys metadata.
Since we don't have any indication in the metadata that the sstable stores
static rows, we must read all sstables if a static column is requested.
Refs #3553
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We're about to drop `filtering_broken` in a future patche
when clustering filtering can be supported for md-format sstables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
MD format is disabled by default at this point.
The option extends enable_sstables_mc_format
so that both are needed to be set for supporting
the md format.
The MD_FORMAT cluster feature will be added in
a following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This is required for test applications that may select a sstable
format different than the default mc format, like perf_fast_forward.
These apps don't use the gossip-based sstables_format_selector
to set the format based on the cluster feature and so they
need to rely on the db config.
Call set_format_by_config in single_node_cql_env::do_with.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Run the test cases that write sstables using both the
mc and md versions. Note that we can still compare the
resulting Data, Index, Digest, and Filter components
with the prepared mc sstables we have since these
haven't changed in md.
We take special consideration around validating
min/max column names that are now calculated using
a revised algorithm in the md format.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Test also the md format in all_sstable_versions.
Add pre-computed md-sstable files generated using Cassandra version 3.11.7
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than using a constant sstable_version_types::mc.
In preparation to supporting sstable_version_types::md.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Partition tombstones represent an implicit clustering range
that is unbound on both sides, so reflect than in min/max
column names metadata using empty clustering key prefixes.
If we don't do that, when using the sstable for filtering, we have no
other way of distinguishing range tombstones from partition tombstones
given the sstable metadata and we would need to include any sstable
with tombstones, even if those are range tombstone, for which
we can do a better filtering job, using the sstable min/max
column names metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We essentially treat min/max column names as range bounds
with min as incl_start and max as incl_end.
By generating a bound_view for min/max column names on the fly,
we can correctly track and compare also short clustering
key prefixes that may be used as bounds for range tombstones.
Extend the sstable_tombstone_metadata_check unit test
to cover these cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The following checks were introduced in 0a5af61176
To deal with a bug in min max metadata generation of our own,
from a time where only ka / la were supported.
This is no longer relevant now that we'll consider min_max_column_names
only for sstable format > mc (in sstable::may_contain_rows)
We choose not to clear_incorrect_min_max_column_names
from older versions here as this disturbs sstable unit tests.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
the min/max column names metadata can be trusted only
starting the md format, so just always return `true`
for older sstable formats.
Note that we could achieve that by clearing the min/max
metadata in set_clustering_components_ranges but we choose
not to do so since it disturbs sstable unit tests
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the logic from table to sstable as it will contain
intimate knowledge of the sstable min/max column names validity
for md format.
Also, get rid of the sstable::clustering_components_ranges() method
as the member is used only internally by the sstable code now.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add the sstable_version_types::md enum value
and logically extend sstable_version_types comparisons to cover
also the > sstable_version_types::mc cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Instead of keeping the `_has_min_max_clustering_keys` flag,
just store an empty key for `_{min,max}_clustering_key` to represent
the full range. These will never be narrowed down and will be
encoded as empty min/max column names as if they weren't set.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently we compare each min/max component independently.
This may lead to suboptimal, inclusive clustering ranges
that do not indicate any actual key we encountered.
For example: ['a', 2], ['b', 1] will lead to min=['a', 1], max=['b', 2]
instead of the keys themselves.
This change keeps the min or max keys as a whole.
It considers shorter clustering prefixes (that are possible with compact
storage) as range tombstone bounds, so that a shorter key is considered
less than the minimum if the latter has a common prefix, and greater
than the maximum if the latter has a common prefix.
Extend the min_max_clustering_key_test to test for this case.
Previously {"a", "2"}, {"b", "1"} clustering keys would erronuously
end up with min={"a", "1"} max={"b", "2"} while we want them to be
min={"a", "2"} max={"b", "1"}.
Adjust sstable_3_x_test to ignore original mc sstables that were
previously computed with different min/max column names.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Pass the sstable schema to the metadata_collector constructor.
Note that the long term plan is to move metadata_collector
to the sstable writer but this requires a bigger change to
get rid of the dependencies on it in the legacy writer code
in class sstable methods.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
C++20 introduced std::erase_if which simplifies removal of elements
from the collection. Previously the code pattern looked like:
<collection>.erase(
std::remove_if(<collection>.begin(), <collection>.end(), <predicate>),
<collection>.end());
In C++20 the same can be expressed with:
std::erase_if(<collection>, <predicate>);
This commit replaces all the occurences of the old pattern with the new
approach.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6ffcace5cce79793ca6bd65c61dc86e6297233fd.1597064990.git.piotr@scylladb.com>
Fixes#6995
In c2c6c71 the assert on replay positions in flushed sstables discarded by
truncate was broken, by the fact that we no longer flush all sstables
unless auto snapshot is enabled.
This means the low_mark assertion does not hold, because we maybe/probably
never got around to creating the sstables that would hold said mark.
Note that the (old) change to not create sstables and then just delete
them is in itself good. But in that case we should not try to verify
the rp mark.
"
not recommended setups will still run iotune
fixes#6631
"
* tarzanek-gcp-iosetup:
scylla_io_setup: Supported GCP VMs with NVMEs get out of box I/O configs
scylla_util.py: add support for gcp instances
scylla_util.py: support http headers in curl function
scylla_io_setup: refactor iotune run to a function
"
As I described at https://github.com/scylladb/scylla/issues/6973#issuecomment-669705374,
we need to avoid disk full on scylla_swap_setup, also we should provide manual configuration of swap size.
This PR provides following things:
- use psutil to get memtotal and disk free, since it provides better APIs
- calculate swap size in bytes to avoid causing error on low-memory environment
- prevent to use more than 50% of disk space when auto-configured swap size, abort setup when almost no disk space available (less than 2GB)
- add --swap-size option to specify swap size both on scylla_swap_setup and scylla_setup
- add interactive swap size prompt on scylla_setup
Fixes#6947
Related #6973
Related scylladb/scylla-machine-image#48
"
* syuu1228-scylla_swap_setup_improves:
scylla_setup: add swap size interactive prompt on swap setup
scylla_swap_setup: add --swap-size option to specify swap size
scylla_swap_setup: limit swapfile size to half of diskspace
scylla_swap_setup: calculate in bytes instead of GB
scylla_swap_setup: use psutil to get memtotal and disk free
Sometimes (e.g. when investigating a suspected heap use-after-free) it
is useful to include dead objects in the search results. This patch adds
a new option to scylla find to enable just that. Also make sure we
save and print the offset of the address in the object for dead
objects, just like we do for live ones.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200810091202.1401231-1-bdenes@scylladb.com>
We should not fill entire disk space with swapfile, it's safer to limit
swap size 50% of diskspace.
Also, if 50% of diskspace <= 1GB, abort setup since it's out of disk space.
Converting memory & disk sizes to an int value of N gigabytes was too rough,
it become problematic in low memory size / low disk size environment, such as
some types of EC2 instances.
We should calculate in bytes.
To get better API of memory & disk statistics, switch to psutil.
With the library we don't need to parse /proc/meminfo.
[avi: regenerate tools/toolchain/image for new python3-psutils package]
It is used only for updating the metadata_collector {min,max}_column_names.
Implement metadata_collector::do_update_min_max_components in
sstables/metadata_collector.cc that will be used to host some other
metadata_collector methods in following patches that need not be
implemented in the header file.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The following metric is added:
scylla_node_maintenance_operations_rebuild_finished_percentage{shard="0",type="gauge"} 0.650000
It is the number of finished percentage for rebuild operation so far.
Fixes#1244, #6733
The following metric is added:
scylla_node_maintenance_operations_bootstrap_finished_percentage{shard="0",type="gauge"} 0.850000
It is the number of finished percentage for bootstrap operation so far.
Fixes#1244, #6733
Otherwise we may trip the following
assert(_closing_state == state::closed);
in ~append_challenged_posix_file_impl
when the output_stream is destructed.
Example stack trace:
non-virtual thunk to seastar::append_challenged_posix_file_impl::~append_challenged_posix_file_impl() at /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/future.hh:944
seastar::shared_ptr_count_for<checked_file_impl>::~shared_ptr_count_for() at crtstuff.c:?
seastar::shared_ptr<seastar::file_impl>::~shared_ptr() at /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/future.hh:944
(inlined by) seastar::file::~file() at /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/file.hh:155
(inlined by) seastar::file_data_sink_impl::~file_data_sink_impl() at /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/src/core/fstream.cc:312
(inlined by) seastar::file_data_sink_impl::~file_data_sink_impl() at /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/src/core/fstream.cc:312
seastar::output_stream<char>::~output_stream() at crtstuff.c:?
sstables::sstable::write_crc(sstables::checksum const&) [clone .cold] at sstables.cc:?
sstables::mc::writer::close_data_writer() at crtstuff.c:?
sstables::mc::writer::consume_end_of_stream() at crtstuff.c:?
sstables::sstable::write_components(flat_mutation_reader, unsigned long, seastar::lw_shared_ptr<schema const>, sstables::sstable_writer_config const&, encoding_stats, seastar::io_priority_class const&)::{lambda()#1}::operator()() at sstables.cc:?
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Unify common code for file creation and file_writer construction
for sstable components.
It is defined as noexcept based on `new_sstable_component_file`
and makes sure the file is closed on error by using `file_writer::make`
that guarantees that.
Will be used for auto-closing the writer as a RAII object.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Get rid of seastar::async.
Use seastar::with_file to make sure the opened file is always closed
and move in.close() into a .finally continuation.
While at it, make remove_by_toc_name noexcept.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
* seastar eb452a22a0...e615054c75 (13):
> memory: fix small aligned free memory corruption
Fixes#6831
> logger: specify log methods as noexcept
> logger: mark trivial methods as noexcept
> everywhere: Replace boost::optional with std::optional
> httpd: fix Expect header case sensitivity
> rpc: Avoid excessive casting in has_handler() helper
> test: Increase capacity of fair-queue unit test case
> file_io_test: Simplify a test with memory::with_allocation_failures
> Merge 'Add HTTP/1.1 100 Continue' from Wojciech
Fixes#6844
> future: Use "if constexpr"
> future: Drop code that was avoiding a gcc 8 warning
> file: specify alignment get methods noexcept
> merge: Specify abort_source subscription handlers as noexcept
read_toc can be marked as noexcept now that new_sstable_component_file is.
With that, other methods that call it can be marked noexcept too.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In preparation for closing the file in all paths,
get rid of the file_path sstring and just recompute
it as needed on error paths using the this->filename method.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that sstable::open_file is noexcept,
if any of open or create of data/index doesn't succeed,
we don't set the respective sstable member and return the failure.
When destroyed, the sstable destructor will close any file (data or index),
that we managed to open.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that new_sstable_component_file is noexcept,
open_file can be specified as noexcept too.
This makes error handling in create/open sstable
data and index files easier using when_all_succeed().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the function is handed over a `file` that it just passes through
on success. Let its single caller do that to simplify its error handling.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that open_integrity_checked_file_dma (and open_file_dma)
are noexcept, open_sstable_component_file_non_checked can be noexcept too.
Also, get a std::string_view name instead of a const sstring&
to match both open_integrity_checked_file_dma and open_file_dma
name arg.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Convert to accepting std::string_view name.
Move the sstring allocation to make_integrity_checked_file
that may already throw.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use seastar::with_file_close_on_failure to make sure the file
we open is closed on failure of the continuation,
as make_integrity_checked_file may throw from ::make_shared.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the classes representing CQL expressions (and utility functions
on them) from the `restrictions` namespace to a new namespace `expr`.
Most of the restriction.hh content was moved verbatim to
expression.hh. Similarly, all expression-related code was moved from
statement_restrictions.cc verbatim to expression.cc.
As suggested in #5763 feedback
https://github.com/scylladb/scylla/pull/5763#discussion_r443210498
Tests: dev (unit)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Iotune isn't run for supported/recommended GCP instances anymore,
we set the I/O properties now based on GCP tables or our
measurements(where applicable).
Not recommended/supported setups will still run iotune.
Fixes#6631
GCP VMs with NVMEs that are recommended can be recognized now.
Detection is done using resolving of internal GCP metadata server.
We recommend 2 cpu instances at least. For more than 16 disks
we mandate at least 32 cpus. 50:1 disk to ram ratio also has to be
kept.
Instances that use NVMEs as root disks are also considered as unsupported.
Supported instances for NVMEs are n1, n2, n2d, c2 and m1-megamem-96.
All others are unsupported now.
While answering a stackoverflow question on how to create an item but only
if we don't already have an item with the same key, I realized that we never
actually tested that ConditionExpressions works on key columns: all the
tests we had in test_condition_expression.py had conditions on non-key
attributes. So in this patch we add two tests with a condition on the key
attribute.
Most examples of conditions on the key attributes would be silly, but
in these two tests we demonstrate how a test on key attributes can be
useful to solve the above need of creating an item if no such item
exists yet. We demonstrate two ways to do this using a condition on
the key - using either the "<>" (not equal) operator, or the
"attribute_not_exists()" function.
These tests pass - we don't have a bug in this. But it's nice to have
a test that confirms that we don't (and don't regress in that area).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200806200322.1568103-1-nyh@scylladb.com>
* tools/java aa7898d771...f2c7cf8d8d (2):
> dist: debian: support non-x86
> NodeProbe: get all histogram values in a single call
* tools/jmx 626fd75...c5ed831 (1):
> dist: debian: support non-x86
"
The current implementation of B+ benefits from using SIMD
instruction in intra-nodes keys search. This set adds this
functionality.
The general idea behind the implementation is in "asking"
the less comparator if it is the plain "<" and allows for
key simplification to do this natural comparison. If it
does, the search key is simplified to int64_t, the node's
array of keys is casted to array of integers, then both are
fed into avx-optimized searcher.
The searcher should work on nodes that are not filled with
keys. For performance the "unused" keys are set to int64_t
minimum, the search loop compares them too (!) and adjusts
the result index by node size. This needs some care in the
maybe_key{} wrapper.
fixes: #186
tests: unit(dev)
"
* 'br-bptree-avx-b' of https://github.com/xemul/scylla:
utils: AVX searcher
bptree: Special intra-node key search when possible
bptree: Add lesses to maybe_key template
token: Restrict TokenCarrier concept with noexcept
The column names in SlicePredicate can be passed in arbitrary order.
We converted them to clustering ranges in read_command preserving the
original order. As a result, the clustering ranges in read command may
appear out of order. This violates storage engine's assumptions and
lead to undefined behavior.
It was seen manifesting as a SIGSEGV or an abort in sstable reader
when executing a get_slice() thrift verb:
scylla: sstables/consumer.hh:476: seastar::future<> data_consumer::continuous_data_consumer<StateProcessor>::fast_forward_to(size_t, size_t) [with StateProcessor = sstables::data_consume_rows_context_m; size_t = long unsigned int]: Assertion `end >= _stream_position.position' failed.
Fixes#6486.
Tests:
- added a new dtest to thrift_tests.py which reproduces the problem
Message-Id: <1596725657-15802-1-git-send-email-tgrabiec@scylladb.com>
In case of an initialization failure after
db.get_compaction_manager().enable();
But before stop_database, we would never stop the compaction manager
and it would assert during destruction.
I am trying to add a test for this using the memory failure injector,
but that will require fixing other crashes first.
Found while debugging #6831.
Refs #6831.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200805181840.196064-1-espindola@scylladb.com>
With all the preparations made so far it's now possible to implement
the avx-powered search in an array.
The array to search in has both -- capacity and size, so searching in
it needs to take allocated, but unused tail into account. Two options
for that -- limit the number of comparisons "by hands" or keep minimal
and impossible value in this tail, scan "capacity" elements, then
correct the result with "size" value. The latter approach is up to 50%
faster than any (tried) attempt to do the former one.
The run-time selection of the array search code is done with the gnu
target attribute. It's available since gcc 4.8. For AVX-less platforms
the default linear scanner is used.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If the key type is int64_t and the less-comparator is "natural" (i.e. it's
literally 'a < b') we may use the SIMD instructions to search for the key
on a node. Before doing so, the maybe_key and the searcher should be prepared
for that, in particular:
1. maybe_key should set unused keys to the minimal value
2. the searcher for this case should call the gt() helper with
primitive types -- int64_t search key and array of int64_t values
To tell to B+ code that the key-less pair is such the less-er should define
the simplify_key() method converting search keys to int64_t-s.
This searcher is selected automatically, if any mismatch happens it silently
falls back to default one. Thus also add a static assertion to the row-cache
to mitigate this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The way maybe_key works will be in-sync with the intra-node searching
code and will require to know what the Less type is, so prepare for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The <search-key>.token() is noexcept currently and will have
to be explicitly such for future optimized key searcher, so
restrict the constraint and update the related classes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The single-node test Scylla run by test/alternator/run uses, as the
default, 256 vnodes. When we have 256 vnodes and two shards, our CDC
implementation produces 512 separate "streams" (called "shards" in
DynamoDB lingo). This causes each of the tests in test_streams.py
which need to read data from the stream to need to do 1024 (!) API
requests (512 calls to GetShardIterator and 512 calls to GetRecords)
which takes significant time - about a second per test.
In this patch, we reduce the number of vnodes to 16. We still have
a non-negligible number of stream "shards" (32) so this part of the
CDC code is still exercised. Moreover, to ensure we still routinely
test the paging feature of DescribeStream (whose default page size
is 100), the patch changes the request to use a Limit of 10, so
paging will still be used to retrieve the list of 32 shards.
The time to run the 27 tests in test_streams.py, on my laptop:
Before this patch: 26 seconds
After this patch: 6 seconds.
Fixes#6979
Message-Id: <20200805093418.1490305-1-nyh@scylladb.com>
This patch adds additional tests for Alternator Streams, which helped
uncover 9 new issues.
The stream tests are noticibly slower than most other Alternator tests -
test_streams.py now has 27 tests taking a total of 20 seconds. Much of this
slowness is attributed to Alternator Stream's 512 "shards" per stream in the
single-node test setup with 256 vnodes, meaning that we need over 1000 API
requests per test using GetRecords. These tests could be made significantly
faster (as little as 4 seconds) by setting a lower number of vnodes.
Issue #6979 is about doing this in the future.
The tests in this patch have comments explaining clearly (I hope) what they
test, and also pointing to issues I opened about the problems discovered
through these tests. In particular, the tests reproduce the following bugs:
Refs #6918
Refs #6926
Refs #6930
Refs #6933
Refs #6935
Refs #6939
Refs #6942
The tests also work around the following issues (and can be changed to
be more strict and reproduce these issues):
Refs #6918
Refs #6931
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200804154755.1461309-1-nyh@scylladb.com>
"
This patch series converts a few more global variables from sstring to
constexpr std::string_view.
Doing that makes it impossible for them to be part of any
initialization order problems.
"
* 'espindola/more-constexpr-v2' of https://github.com/espindola/scylla:
auth: Turn DEFAULT_USER_NAME into a std::string_view variable
auth: Turn SALTED_HASH into a std::string_view variable
auth: Turn meta::role_members_table::qualified_name into a std::string_view variable
auth: Turn meta::roles_table::qualified_name into a std::string_view variable
auth: Turn password_authenticator_name into a std::string_view variable
auth: Inline default_authorizer_name into only use
auth: Turn allow_all_authorizer_name into a std::string_view variable
auth: Turn allow_all_authenticator_name into a std::string_view variable
If initialization of a TLS variable fails there is nothing better to
do than call std::unexpected.
This also adds a disable_failure_guard to avoid errors when using
allocation error injection.
With init() being noexcept, we can also mark clear_functions.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200804180550.96150-1-espindola@scylladb.com>
There is no constexpr operator+ for std::string_view, so we have to
concatenate the strings ourselves.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6910
by Wojciech Mitros:
This patch enables selecting more than 2^32 rows from a table. The change
becomes active after upgrading whole cluster - until then old limits are
used.
Tested reading 4.5*10^9 rows from a virtual table, manually upgrading a
cluster with ccm and performing cql SELECT queries during the upgrade,
ran unit tests in dev mode and cql and paging dtests.
tests: add large paging state tests
increase the maximum size of query results to 2^64
Add a unit test checking if the top 32 bits of the number of remaining rows in paging state
is used correctly and a manual test checking if it's possible to select over 2^32 rows from a table
and a virtual reader for this table.
This commit takes out some responsibilities of `cdc::transformer`
(which is currently a big ball of mud) into a separate class.
This class is a simple abstraction for creating entries in a CDC log
mutation.
Low-level calls to the mutation API (such as `set_cell`) inside
`cdc::transformer` were replaced by higher-level calls to the builder
abstraction, removing some duplication of logic.
util/loading_cache.hh includes adjusted.
* seastar 02ad74fa7d...eb452a22a0 (17):
> core: add missing include for std::allocator_traits
> exceptions: move timed_out_error and factory into its own header file
> future: parallel_for_each: add disable_failure_guard for parallel_for_each_state
> Merge "Improve file API noexcept correctness" from Rafael
> util: Add a with_allocation_failures helper
> future: Fix indentation
> future: Refactor duplicated try/catch
> future: Make set_to_current_exception public
> future: Add noexcept to continuation related functions
> core: mark timer cancellation functions as noexcept
> future: Simplify future::schedule
> test: add a case for overwriting exact routes
> http: throw on duplicated routes to prevent memory leaks
> metrics: Remove the type label
> fstream: turn file_data_source_impl's memory corruption bugs into aborts
> doc: update tutorial splitting script
> reactor_backend: let the reactor know again if any work was done by aio backend
Merged pull request https://github.com/scylladb/scylla/pull/6969 by
Calle Wilund:
Fixes#6942Fixes#6926Fixes#6933
We use clustering [lo:hi) range for iterator query.
To avoid encoding inclusive/exclusive range (depending on
init/last get_records call), instead just increment
the timeuuid threshold.
Also, dynamo result always contains a "records" entry. Include one for us as well.
Also, if old (or new) image for a change set is empty, dynamo will not include
this key at all. Alternator did return an empty object. This changes it to be excluded
on empty.
alternator::streams: Don't include empty new/old image
alternator::streams: Always include "Records" array in get_records reponse
alternator::streams: Incr shard iterator threshold in get_records
"
With this patches a monitor is destroyed before the writer, which
simplifies the writer destructor.
"
* 'espindola/simplify-write-monitor-v2' of https://github.com/espindola/scylla:
sstables: Delete write_failed
sstables: Move monitor after writer in compaction_writer
Fixes#6933
If old (or new) image for a change set is empty, dynamo will not
include this key at all. Alternator did return an empty object.
This changes it to be excluded on empty.
Fixes#6942
We use clustering [lo:hi) range for iterator query.
To avoid encoding inclusive/exclusive range (depending on
init/last get_records call), instead just increment
the timeuuid threshold.
Fixes#6866
If we try to create/alter an Alternator table to include streams,
we must check that the cluster does in fact support CDC
(experimental still). If not, throw a hopefully somewhat descriptive
error.
(Normal CQL table create goes through a similar check in cql_prop_defs)
Note: no other operations are prohibited. The cluster could have had CDC
enabled before, so streams could exist to list and even read.
Any tables loaded from schema tables should be reposnsible for their
own validation.
Refs #6864
When booting a clean scylla, CDC stream ID:s will not be availble until
a n*ring delay time period has passed. Before this, writing to a CDC
enabled table will fail hard.
For alternator (and its tests), we can report the stream(s) for tables as not yet
available (ENABLING) until such time as id:s are
computed.
v2:
* Keep storage service ref in executor
"
This is inspired by #6781. The idea is to make Scylla listen for CQL connections on port 9042 (where both old shard-aware and shard-unaware clients can still connect the traditional way). On top of that I added a new port, where everything works the same way, only the port from client's socket used to determine the shard No. to connect to. Desired shard No. is the result of `clientside_port % num_shards`.
The new port is configurable from scylla.yaml and defaults to 19042 (unencrypted, unless user configures encryption options and omits `native_shard_aware_transport_port_ssl` in DB config).
Two "SUPPORTED" tags are added: "SCYLLA_SHARD_AWARE_PORT" and "SCYLLA_SHARD_AWARE_PORT_SSL". For compatibility, "SCYLLA_SHARDING_ALGORITHM" is still kept.
Fixes#5239
"
* jul-stas-shard-aware-listener:
docs: Info about shard-aware listeners in protocol-extensions
transport: Added listener with port-based load balancing
Currently, we cannot select more than 2^32 rows from a table because we are limited by types of
variables containing the numbers of rows. This patch changes these types and sets new limits.
The new limits take effect while selecting all rows from a table - custom limits of rows in a result
stay the same (2^32-1).
In classes which are being serialized and used in messaging, in order to be able to process queries
originating from older nodes, the top 32 bits of new integers are optional and stay at the end
of the class - if they're absent we assume they equal 0.
The backward compatibility was tested by querying an older node for a paged selection, using the
received paging_state with the same select statement on an upgraded node, and comparing the returned
rows with the result generated for the same query by the older node, additionally checking if the
paging_state returned by the upgraded node contained new fields with correct values. Also verified
if the older node simply ignores the top 32 bits of the remaining rows number when handling a query
with a paging_state originating from an upgraded node by generating and sending such a query to
an older node and checking the paging_state in the reply(using python driver).
Fixes#5101.
Fixes#6341
Since scylla no longer supports upgrading from a version without the
"new" (dedicated) truncation record table, we can remove support for these
and the migtration thereof.
Make sure the above holds whereever this is committed.
Note that this does not remove the "truncated_at" field in
system.local.
Merged pull request https://github.com/scylladb/scylla/pull/6914
by By Juliusz Stasiewicz:
The goal is to have finer control over CDC "delta" rows, i.e.:
disable them totally (mode off);
record only base PK+CK columns (mode keys);
make them behave as usual (mode full, default).
The editing of log rows is performed at the stage of finishing CDC mutation.
Fixes#6838
tests: Added CQL test for `delta mode`
cdc: Implementations of `delta_mode::off/keys`
cdc: Infrastructure for controlling `delta_mode`
The rule for THE REST results in each person listed in it
to receive notifications about every single pull request,
which can easily lead to inbox overload - the generic
rule is therefore dropped and authors of pull requests
are expected to manually add reviewers. GitHub offers
semi-random suggestions for reviewers anyway.
Message-Id: <3c0f7a2f13c098438a8abf998ec56b74db87c733.1596450426.git.sarna@scylladb.com>
This reverts commit b97f466438.
It turns out that the schema mechanism has a lot of nuances,
after this change, for unknown reason, it was empirically
proven that the amount of cross shard on an upgraded node was
increased significantly with a steady stress traffic, if
was so significant that the node appeared unavailable to
the coordinators because all of the requests started to fail
on smp_srvice_group semaphore.
This revert will bring back a caveat in Scylla, the caveat is
that creating a table in a mixed cluster **might** under certain
condition cause schema mismatch on the newly created table, this
make the table essentially unusable until the whole cluster has
a uniform version (rolling upgrade or rollback completion).
Fixes#6893.
We need only one of the shards owning each ssatble to call create_links.
This will allow us to simplify it and only handle crash/replay scenarios rather than rename/link/remove races.
Fixes#1622
Test: unit(dev), database_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200803065505.42100-3-bhalevy@scylladb.com>
The "outdir" variable in configure.py and "$builddir" in build.ninja
file specifies the build directory. Let's use them to eliminate
hard-coded "build" paths from configure.py.
Message-Id: <20200731105113.388073-1-penberg@scylladb.com>
Alternator Streams have a "alternator_streams_time_window_s" parameter which
is used to allow for correct ordering in the stream in the face of clock
differences between Scylla nodes and possibly network delays. This parameter
currently defaults to 10 seconds, and there is a discussion on issue #6929
on whether it is perhaps too high. But in any case, for tests running on a
single node there is no reason not to set this parameter to zero.
Setting this parameter to zero greatly speeds up the Alternator Streams
tests which use ReadRecords to read from the stream. Previously each such
test took at least 10 seconds, because the data was only readable after a
10 second delay. With alternator_streams_time_window_s=0, these tests can
finish in less than a second. Unfortunately they are still relatively slow
because our Streams implementation has 512 shards, and thus we need over a
thousand (!) API calls to read from the stream).
Running "test/alternator/run test_streams.py" with 25 tests took before
this patch 114 seconds, after this patch, it is down to 18 seconds.
Refs #6929
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reviewed-by: Calle Wilund <calle@scylladb.com>
Message-Id: <20200728184612.1253178-1-nyh@scylladb.com>
"
While working on another patch I was getting odd compiler errors
saying that a call to ::make_shared was ambiguous. The reason was that
seastar has both:
template <typename T, typename... A>
shared_ptr<T> make_shared(A&&... a);
template <typename T>
shared_ptr<T> make_shared(T&& a);
The second variant doesn't exist in std::make_shared.
This series drops the dependency in scylla, so that a future change
can make seastar::make_shared a bit more like std::make_shared.
"
* 'espindola/make_shared' of https://github.com/espindola/scylla:
Everywhere: Explicitly instantiate make_lw_shared
Everywhere: Add a make_shared_schema helper
Everywhere: Explicitly instantiate make_shared
cql3: Add a create_multi_column_relation helper
main: Return a shared_ptr from defer_verbose_shutdown
"
To mount RAID volume correctly (#6876), we need to wait for MDRAID initialization.
To do so we need to add After=mdmonitor.service on var-lib-scylla.mount.
Also, `lsblk -n -oPARTTYPE {dev}` does not work for CentOS7, since older lsblk does not supported PARTTYPE column (#6954).
We need to provide relocatable lsblk and run it on out() / run() function instead of distribution provided version.
"
* syuu1228-scylla_raid_setup_mount_correctly_beyond_reboot:
scylla_raid_setup: initialize MDRAID before mounting data volume
create-relocatable-package.py: add lsblk for relocatable CLI tools
scylla_util.py: always use relocatable CLI tools
The constructors of these global variables can allocate memory. Since
the variables are thread_local, they are initialized at first use.
There is nothing we can do if these allocations fail, so use
disable_failure_guard.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200729184901.205646-1-espindola@scylladb.com>
The row_cache::_partitions type is nowadays a double_decker which is B+tree of
intrusive_arrays of cache_entrys, so scylla cache command will raise an error
being unable to parse this new data type.
The respective iterator for double decker starts on the tree and walks the list
of leaf nodes, on each node it walks the plain array of data nodes, then on each
data node it walks the intrusive array of cache_entrys yielding them to the
caller.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200730145851.8819-1-xemul@scylladb.com>
The new port is configurable from scylla.yaml and defaults to 19042
(unencrypted, unless client configures encryption options and omits
`native_shard_aware_transport_port_ssl`).
Two "SUPPORTED" tags are added: "SCYLLA_SHARD_AWARE_PORT" and
"SCYLLA_SHARD_AWARE_PORT_SSL". For compatibility,
"SCYLLA_SHARDING_ALGORITHM" is still kept.
Fixes#5239
var-lib-scylla.mount should wait for MDRAID initilization, so we need to add
'After=mdmonitor.service'.
However, currently mdmonitor.service fails to start due to no mail address
specified, we need to add the entry on mdadm.conf.
Fixes#6876
On some CLI tools, command options may different between latest version
vs older version.
To maximize compatibility of setup scripts, we should always use
relocatable CLI tools instead of distribution version of the tool.
Related #6954
The .gitignore entry for testlog/ directory is generalized
from "testlog/*" to "testlog", in order to please everyone
who potentially wants test logs to use ramfs by symlinking
testlog to /tmp. Without the change, the symlink remains
visible in `git status`.
Message-Id: <e600f5954868aea7031beb02b1d8e12a2ff869e2.1596115302.git.sarna@scylladb.com>
Replace the MAINTAINERS file with a CODEOWNERS file, which Github is
able to parse, and suggest reviewers for pull requests.
* penberg-penberg/codeowners:
Replace MAINTAINERS with CODEOWNERS
Update MAINTAINERS
The tests in this patch, which pass on DynamoDB but fail on Alternator,
reproduce a bug described in issue #6951. This bug makes it impossible for
a Query (or Scan) to filter on an attribute if that attribute is not
requested to be included in the output.
This patch includes two xfailing tests of this type: One testing a
combination of FilterExpression and ProjectionExpression, and the second
testing a combination of QueryFilter and AttributesToGet; These two
pairs are, respectively, DynamoDB's newer and older syntaxes to achieve
the same thing.
Additionally, we add two xfailing tests that demonstrates that combining
old and new style syntax (e.g., FilterExpression with AttributesToGet)
should not have been allowed (DynamoDB doesn't allow such combinations),
but Alternator currently accepts these combinations.
Refs #6951
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200729210346.1308461-1-nyh@scylladb.com>
The mutation object may be freed prematurely during commitlog replay
in the schema upgrading path. We will hit the problem if the memtable
is full and apply_in_memory() needs to defer.
This will typically manifest as a segfault.
Fixes#6953
Introduced in 79935df
Tests:
- manual using scylla binary. Reproduced the problem then verified the fix makes it go away
Message-Id: <1596044010-27296-1-git-send-email-tgrabiec@scylladb.com>
The test/alternator/run script follows the pytest log with a full log of
Scylla. This saved log can be useful in diagnosing problems, but most of
it is filled with non-useful "INFO"-level messages. The two biggest
offenders are compaction - which logs every single compaction happening,
and the migration manager, which is just a second (and very long) message
about schema change operations (e.g., table creations). Neither of these
are interesting for Alternator's tests, which shouldn't care exactly when
compaction of which sstable is happening. These two components alone
are reponsible for 80% of the log lines, and 90% of the log bytes!
In this patch we increase the log level of just these two components -
compaction and migration_manager - to WARN, which reduces the log
by the same percentages (80% by lines, 90% by bytes).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200728191420.1254961-1-nyh@scylladb.com>
On CentOS7, systemd does not support percentage-based parameter.
To apply memory parameter on CentOS7, we need to override the parameter
in bytes, instead of percentage.
Fixes#6783
"
Non-paged queries completely ignore the query result size limiter
mechanism. They consume all the memory they want. With sufficiently
large datasets this can easily lead to a handful or even a single
unpaged query producing an OOM.
This series continues the work started by 134d5a5f7, by introducing a
configurable pair of soft/hard limit (default to 1MB/100MB) that is
applied to otherwise unlimited queries, like reverse and unpaged ones.
When an unlimited query reaches the soft limit a warning is logged. This
should give users some heads-up to adjust their application. When the
hard limit is reached the query is aborted. The idea is to not greet
users with failing queries after an upgrade while at the same time
protect the database from the really bad queries. The hard limit should
be decreased from time to time gradually approaching the desired goal of
1MB.
We don't want to limit internal queries, we trust ourselves to either
use another form of memory usage control, or read only small datasets.
So the limit is selected according to the query class. User reads use
the `max_memory_for_unlimited_query_{soft,hard}_limit` configuration
items, while internal reads are not limited. The limit is obtained by
the coordinator, who passes it down to replicas using the existing
`max_result_size` parameter (which is not a special type containing the
two limits), which is now passed on every verb, instead of once per
connection. This ensures that all replicas work with the same limits.
For normal paged queries `max_result_size` is set to the usual
`query::result_memory_limiter::maximum_result_size` For queries that can
consume unlimited amount of memory -- unpaged and reverse queries --
this is set to the value of the aforementioned
`max_memory_for_unlimited_query_{soft,hard}_limit` configuration item,
but only for user reads, internal reads are not limited.
This has the side-effect that reverse reads now send entire
partitions in a single page, but this is not that bad. The data was
already read, and its size was below the limit, the replica might as well
send it all.
Fixes: #5870
"
* 'nonpaged-query-limit/v5' of https://github.com/denesb/scylla: (26 commits)
test: database_test: add test for enforced max result limit
mutation_partition: abort read when hard limit is exceeded for non-paged reads
query-result.hh: move the definition of short_read to the top
test: cql_test_env: set the max_memory_unlimited_query_{soft,hard}_limit
test: set the allow_short_read slice option for paged queries
partition_slice_builder: add with_option()
result_memory_accounter: remove default constructor
query_*(): use the coordinator specified memory limit for unlimited queries
storage_proxy: use read_command::max_result_size to pass max result size around
query: result_memory_limiter: use the new max_result_size type
query: read_command: add max_result_size
query: read_command: use tagged ints for limit ctor params
query: read_command: add separate convenience constructor
service: query_pager: set the allow_short_read flag
result_memory_accounter: check(): use _maximum_result_size instead of hardcoded limit
storage_proxy: add get_max_result_size()
result_memory_limiter: add unlimited_result_size constant
database: add get_statement_scheduling_group()
database: query_mutations(): obtain the memory accounter inside
query: query_class_config: use max_result_size for the max_memory_for_unlimited_query field
...
* tools/java a9480f3a87...aa7898d771 (4):
> dist: debian: do not require root during package build
> cassandra-stress: Add serial consistency options
> dist: debian: fix detection of debuild
> bin tools: Use non-default `cassandra.config`
* tools/jmx c0d9d0f...626fd75 (1):
> dist: debian: do not require root during package build
Fixes#6655.
If the read is not paged (short read is not allowed) abort the query if
the hard memory limit is reached. On reaching the soft memory limit a
warning is logged. This should allow users to adjust their application
code while at the same time protecting the database from the really bad
queries.
The enforcement happens inside the memory accounter and doesn't require
cooperation from the result builders. This ensures memory limit set for
the query is respected for all kind of reads. Previously non-paged reads
simply ignored the memory accounter requesting the read to stop and
consumed all the memory they wanted.
This is a 4.2% reduction in the scylla text size, from 38975956 to
37404404 bytes.
When benchmarking perf_simple_query without --shuffle-sections, there
is no performance difference.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200724032504.3004-1-espindola@scylladb.com>
Some tests use the lower level methods directly and meant to use paging
but didn't and nobody noticed. This was revealed by the enforcement of
max result size (introduced in a later patch), which caused these tests
to fail due to exceeding the max result size.
This patch fixes this by setting the `allow_short_reads` slice option.
If somebody wants to bypass proper memory accounting they should at
the very least be forced to consider if that is indeed wise and think a
second about the limit they want to apply.
It is important that all replicas participating in a read use the same
memory limits to avoid artificial differences due to different amount of
results. The coordinator now passes down its own memory limit for reads,
in the form of max_result_size (or max_size). For unpaged or reverse
queries this has to be used now instead of the locally set
max_memory_unlimited_query configuration item.
To avoid the replicas accidentally using the local limit contained in
the `query_class_config` returned from
`database::make_query_class_config()`, we refactor the latter into
`database::get_reader_concurrency_semaphore()`. Most of its callers were
only interested in the semaphore only anyway and those that were
interested in the limit as well should get it from the coordinator
instead, so this refactoring is a win-win.
Use the recently added `max_result_size` field of `query::read_command`
to pass the max result size around, including passing it to remote
nodes. This means that the max result size will be sent along each read,
instead of once per connection.
As we want to select the appropriate `max_result_size` based on the type
of the query as well as based on the query class (user or internal) the
previous method won't do anymore. If the remote doesn't fill this
field, the old per-connection value is used.
This field will replace max size which is currently passed once per
established rpc connection via the CLIENT_ID verb and stored as an
auxiliary value on the client_info. For now it is unused, but we update
all sites creating a read command to pass the correct value to it. In the
next patch we will phase out the old max size and use this field to pass
max size on each verb instead.
The convenience constructor of read_command now has two integer
parameter next to each other. In the next patch we intend to add another
one. This is recipe for disaster, so to avoid mistakes this patch
converts these parameters to tagged integers. This makes sure callers
pass what they meant to pass. As a matter of fact, while fixing up
call-sites, I already found several ones passing `query::max_partitions`
to the `row_limit` parameter. No harm done yet, as
`query::max_partitions` == `query::max_rows` but this shows just how
easy it is to mix up parameters with the same type.
query::read_command currently has a single constructor, which serves
both as an idl constructor (order of parameters is fixed) and a convenience one
(most parameters have default values). This makes it very error prone to
add new parameters, that everyone should fill. The new parameter has to
be added as last, with a default value, as the previous ones have a
default value as well. This means the compiler's help cannot be enlisted
to make sure all usages are updated.
This patch adds a separate convenience constructor to be used by normal
code. The idl constructor looses all default parameters. New parameters
can be added to any position in the convenience constructor (to force
users to fill in a meaningful value) while the removed default
parameters from the idl constructor means code cannot accidentally use
it without noticing.
All callers should set this already before passing the slice to the
pager, however not all actually do (e.g.
`cql3::indexed_table_select_statement::read_posting_list()`). Instead of
auditing each call site, just make sure this is set in the pager
itself. If someone is creating a pager we can be sure they mean to use
paging.
The use of the global `result_memory_limiter::maximum_result_size` is
probably a leftover from before the `_maximum_result_size` member was
introduced (aa083d3d85).
Meant to be used by the coordinator node to obtain the max result size
applicable to the query-class (determined based on the current
scheduling group). For normal paged queries the previously used
`query::result_memory_limiter::maximum_result_size` is used uniformly.
For reverse and unpaged queries, a query class dependent value is used.
For user reads, the value of the
`max_memory_for_unlimited_query_{soft,hard}_limit` configuration items
is used, for other classes no limit is used
(`query::result_memory_limiter::unlimited_result_size`).
We want to switch from using a single limit to a dual soft/hard limit.
As a first step we switch the limit field of `query_class_config` to use
the recently introduced type for this. As this field has a single user
at the moment -- reverse queries (and not a lot of propagation) -- we
update it in this same patch to use the soft/hard limit: warn on
reaching the soft limit and abort on the hard limit (the previous
behaviour).
This pair of limits replace the old max_memory_for_unlimited_query one,
which remains as an alias to the hard limit. The soft limit inherits the
previous value of the limit (1MB), when this limit is reached a warning
will be logged allowing the users to adjust their client codes without
downtime. The hard limit starts out with a more permissive default of
100MB. When this is reached queries are aborted, the same behaviour as
with the previous single limit.
The idea is to allow clients a grace period for fixing their code, while
at the same time protecting the database from the really bad queries.
Now that there are no ad-hoc aliases needing to overwrite the name and
description parameter of this method, we can drop these and have each
config item just use `name()` and `desc()` to access these.
We already uses aliases for some configuration items, although these are
created with an ad-hoc mechanism that only registers them on the command
line. Replace this with the built-in alias mechanism in the previous
patch, which has the benefit of conflict resolution and also working
with YAML.
Allow configuration items to also have an alias, besides the name.
This allows easy replacement of configuration items, with newer names,
while still supporting the old name for backward compatibility.
The alias mechanism takes care of registering both the name and the
alias as command line arguments, as well as parsing them from YAML.
The command line documentation of the alias will just refer to the name
for documentation.
This procedure consists of trimming SSTables off a compaction job until its weight[1]
is smaller than one already taken by a running compaction. Min threshold is respected
though, we only trim a job while its size is > min threshold.
[1]: this value is a logarithimic function of the total size of the SSTables in a
given job, and it's used to control the compaction parallelism.
It's intended to improve the compaction efficiency by allowing more jobs to run in
parallel, but it turns out that this can have an opposite effect because the write
amplification can be significantly increased.
Take STCS for example, the more similar-sized SSTables you compact together, the
higher the compaction efficiency will be. With the trimming procedure, we're aiming
at running smaller jobs, thinking that running more parallel compactions will provide
us with better performance, but that's not true. Most of the efficiency comes from
making informed decisions when selecting candidates for compaction.
Similarly, this will also hurt TWCS, which does STCS in current window, and a sort
of major compaction when the current window closes. If the TWCS jobs are trimmed,
we'll likely need another compaction to get to the desired state, recompacting
the same data again.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200728143648.31349-1-raphaelsc@scylladb.com>
On GCE, /dev/sda14 reported as unused disk but it's BIOS boot partition,
should not use for scylla data partition, also cannot use for it since it's
too small.
It's better to exclude such partiotion from unsed disk list.
Fixes#6636
We saw scylla hit user after free in repair with the following procedure during tests:
- n1 and n2 in the cluster
- n2 ran decommission
- n2 sent data to n1 using repair
- n2 was killed forcely
- n1 tried to remove repair_meta for n1
- n1 hit use after free on repair_meta object
This was what happened on n1:
1) data was received -> do_apply_rows was called -> yield before create_writer() was called
2) repair_meta::stop() was called -> wait_for_writer_done() / do_wait_for_writer_done was called
with _writer_done[node_idx] not engaged
3) step 1 resumed, create_writer() was called and _repair_writer object was referenced
4) repair_meta::stop() finished, repair_meta object and its member _repair_writer was destroyed
5) The fiber created by create_writer() at step 3 hit use after free on _repair_writer object
To fix, we should call wait_for_writer_done() after any pending
operations were done which were protected by repair_meta::_gate. This
prevents wait for writer done finishes before the writer is in the
process of being created.
Fixes: #6853Fixes: #6868
Backports: 4.0, 4.1, 4.2
The log "removing pending replacing node" is printed whenever a node
jumps to normal status including a normal restart. For example, on
node1, we saw the following when node2 restarts.
[shard 0] storage_service - Node 127.0.0.2 state jump to normal
[shard 0] storage_service - Remove node 127.0.0.2 from pending replacing endpoint
This is confusing since no node is really being replaced.
To fix, log only if a node is really removed from the pending replacing
nodes.
In addition, since do_remove_node will call del_replacing_endpoint,
there is no need to call del_replacing_endpoint again in
storage_service::handle_state_normal after do_remove_node.
Fixes#6936
Currently, encountering an error when loading view build progress
would result in view builder refusing to start - which also means
that future views would not be built until the server restarts.
A more user-friendly solution would be to log an error message,
but continue to boot the view builder as if no views are currently
in progress, which would at least allow future views to be built
correctly.
The test case is also amended, since now it expects the call
to return that "no view builds are in progress" instead of
an exception.
Fixes#6934
Tests: unit(dev)
Message-Id: <9f26de941d10e6654883a919fd43426066cee89c.1595922374.git.sarna@scylladb.com>
In untyped_result_set::get_view, there exists a silent assumption
that the underlying data, which is an optional, to always be engaged.
In case the value happens to be disengaged it may lead to creating
an incorrect bytes view from a disengaged optional.
In order to make the code safer (since values parsed by this code
often come from the network and can contain virtually anything)
a segfault is replaced with an exception, by calling optional's
value() function, which throws when called on disengaged optionals.
Fixes#6915
Tests: unit(dev)
Message-Id: <6e9e4ca67e0e17c17b718ab454c3130c867684e2.1595834092.git.sarna@scylladb.com>
Turns out the fix f591c9c710 wasn't enough to make sure all input streams
are properly closed on failure.
It only closes the main input stream that belongs to context, but it misses
all the input streams that can be opened in the consumer for promote index
reading. Consumer stores a list of indexes, where each of them has its own
input stream. On failure, we need to make sure that every single one of
them is properly closed before destroying the indexes as that could cause
memory corruption due to read ahead.
Fixes#6924.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200727182214.377140-1-raphaelsc@scylladb.com>
With this the monitor is destroyed first. It makes intuitive sense to
me to destroy a monitor_X before X. This is also the order we had
before 55a8b6e3c9.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The goal is to have finer control over CDC "delta" rows, i.e.:
- disable them totally (mode `off`);
- record only PK+CK (mode `keys`);
- make them behave as usual (mode `full`, default).
This commit adds the necessary infrastructure to `cdc_options`.
In commit 8d27e1b, we added tracing (see docs/tracing.md) support to
Alternator requests. However, we never had a functional test that
verifies this feature actually works as expected, and we recently
noticed that for the GetItem and BatchGetItem requestd, the trace
doesn't really work (it returns an empty list of events).
So this patch adds a test, test/alternator/test_tracing.py, which verifies
that the tracing feature works for the PutItem, GetItem, DeleteItem,
UpdateItem, BatchGetItem, BatchWriteItem, Query and Scan operations.
This test is very peculiar. It needs to use out-of-band REST API
requests to enable and disable tracing (of course, the test is skipped
when running against AWS - this is a Scylla-only feature). It also needs
to read CQL-only system tables and does this using Alternator's
".scylla.alternator" interface for system tables - which came through
for us here beautifully and demonstrated their usefulness.
I paid a lot of attention for this test to remain reasonably fast -
this entire test now runs in a little less than one second. Achieving
this while testing eight different requests was a bit of a challenge,
because traces take time until they are visible in the trace table.
This is the main reason why in this patch the test for all eight
request types are done in one test, instead of eight separate tests.
Fixes#6891
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200727115401.1199024-1-nyh@scylladb.com>
"
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.
This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.
This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.
In particular this caused severe problems with ranges scans, which in
some cases ended up using different semaphores per page resulting in a
crash. This could happen because when the page was read locally the code
would run in the statement scheduling group, but when the request
arrived from a remote coordinator via rpc, it was read in a system
scheduling group. This caused a mismatch between the semaphore the saved
reader was created with and the one the new page was read with. The
result was that in some cases when looking up a paused reader from the
wrong semaphore, a reader belonging to another read was returned,
creating a disconnect between the lifecycle between readers and that of
the slice and range they were referencing.
This series fixes the underlying problem of the scheduling group
influencing the verb handler registration, as well as adding some
additional defenses if this semaphore mismatch ever happens in the
future. Inactive read handles are now unique across all semaphores,
meaning that it is not possible anymore that a handle succeeds in
looking up a reader when used with the wrong semaphore. The range scan
algorithm now also makes sure there is no semaphore mismatch between the
one used for the current page and that of the saved reader from the
previous page.
I manually checked that each individual defense added is already
preventing the crash from happening.
Fixes: #6613Fixes: #6907Fixes: #6908
Tests: unit(dev), manual(run the crash reproducer, observe no crash)
"
* 'query-classification-regressions/v1' of https://github.com/denesb/scylla:
multishard_mutation_query: use cached semaphore
messaging: make verb handler registering independent of current scheduling group
multishard_mutation_query: validate the semaphore of the looked-up reader
reader_concurrency_semaphore: make inactive read handles unique across semaphores
reader_concurrency_semaphore: add name() accessor
reader_concurrency_semaphore: allow passing name to no-limit constructor
Issue #6919 was caused by an incorrect assumption: I *assumed* that we see
the tracing session record, we can be sure that the event records for this
session had already been written. In this patch we add a paragraph to
the tracing documentation - docs/tracing.md, which explains that this
assumption is in fact incorrect:
1. On a multi-node setup, replicas may continue to write tracing events
after the coordinator "finished" (moved to background) the request
and wrote the session record.
2. Even on a single-node setup, the writes of the session record and the
individual events are asynchronous, and can happen in an unexpected
order (which is what happened in issue #6919).
Refs #6919.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200727102438.1194314-1-nyh@scylladb.com>
On 2d63acdd6a we replaced 'ol' and 'amzn'
to 'oracle' and 'amazon', but distro.id() actually returns 'amzn' for
Amazon Linux 2, so we need to revert the change.
Fixes#6882
Merged patch set by Botond Dénes:
The view update generation process creates two readers. One is used to
read the staging sstables, the data which needs view updates to be
generated for, and another reader for each processed mutation, which
reads the current value (pre-image) of each row in said mutation. The
staging reader is created first and is kept alive until all staging data
is processed. The pre-image reader is created separately for each
processed mutation. The staging reader is not restricted, meaning it
does not wait for admission on the relevant reader concurrency
semaphore, but it does register its resource usage on it. The pre-image
reader however *is* restricted. This creates a situation, where the
staging reader possibly consumes all resources from the semaphore,
leaving none for the later created pre-image reader, which will not be
able to start reading. This will block the view building process meaning
that the staging reader will not be destroyed, causing a deadlock.
This patch solves this by making the staging reader restricted and
making it evictable. To prevent thrashing -- evicting the staging reader
after reading only a really small partition -- we only make the staging
reader evictable after we have read at least 1MB worth of data from it.
test/boost: view_build_test: add test_view_update_generator_buffering
test/boost: view_build_test: add test test_view_update_generator_deadlock
reader_permit: reader_resources: add operator- and operator+
reader_concurrency_semaphore: add initial_resources()
test: cql_test_env: allow overriding database_config
mutation_reader: expose new_reader_base_cost
db/view: view_updating_consumer: allow passing custom update pusher
db/view: view_update_generator: make staging reader evictable
db/view: view_updating_consumer: move implementation from table.cc to view.cc
database: add make_restricted_range_sstable_reader()
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
---
db/view/view_updating_consumer.hh | 51 ++++++++++++++++++++++++++++---
db/view/view.cc | 39 +++++++++++++++++------
db/view/view_update_generator.cc | 19 +++++++++---
3 files changed, 91 insertions(+), 18 deletions(-)
In some cases estimated number of partitions can be 0, which is albeit a
legit estimation result, breaks many low-level sstable writer code, so
some of these have assertions to ensure estimated partitions is > 0.
To avoid hitting this assert all users of the sstable writers do the
clamping, to ensure estimated partitions is at least 1. However leaving
this to the callers is error prone as #6913 has shown it. As this
clamping is standard practice, it is better to do it in the writers
themselves, avoiding this problem altogether. This is exactly what this
patch does. It also adds two unit tests, one that reproduces the crash
in #6913, and another one that ensures all sstable writers are fine with
estimated partitions being 0 now. Call sites previously doing the
clamping are changed to not do it, it is unnecessary now as the writer
does it itself.
Fixes#6913
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>
"
This makes sure that monitors are always owned by the same struct that
owns the monitored writer, simplifying the lifetime management.
This hopefully fixes some of the crashes we have observed around this
area.
"
* 'espindola/use-compaction_writer-v6' of https://github.com/espindola/scylla:
sstables: Rename _writer to _compaction_writer
sstables: Move compaction_write_monitor to compaction_writer
sstables: Add couple of writer() getters to garbage_collected_sstable_writer
sstables: Move compaction_write_monitor earlier in the file
A DELETE statement checks that the deletion range is symmetrically
bounded. This check was broken for expression TRUE.
Test the fix by setting initial_key_restrictions::expression to TRUE,
since CQL doesn't currently allow WHERE TRUE. That change has been
proposed anyway in feedback to #5763:
https://github.com/scylladb/scylla/pull/5763#discussion_r443213343
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.
This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.
This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.
The new verb is used to replace the current gossip shadow round
implementation. Current shadow round implementation reuses the gossip
syn and ack async message, which has plenty of drawbacks. It is hard to
tell if the syn messages to a specific peer node has responded. The
delayed responses from shadow round can apply to the normal gossip
states even if the shadow round is done. The syn and ack message
handler are full special cases due to the shadow round. All gossip
application states including the one that are not relevant are sent
back. The gossip application states are applied and the gossip
listeners are called as if is in the normal gossip operation. It is
completely unnecessary to call the gossip listeners in the shadow round.
This patch introduces a new verb to request the exact gossip application
states the shadow round needed with a synchronous verb and applies the
application states without calling the gossip listeners. This patch
makes the shadow round easier to reason about, more robust and
efficient.
Refs: #6845
Tests: update_cluster_layout_tests.py
"
`close_on_failure` was committed to seastar so use
the library version.
This requires making the lambda function passed to
it nothrow move constructible, so this series also
makes db::commitlog::descriptor move constructor noexcept
and changes allocate_segment_ex and segment::segment
to get a descriptor by value rather than by reference.
Test: unit(dev), commitlog_test(debug)
"
* tag 'commit-log-use-with_file_close_on_failure-v1' of github.com:bhalevy/scylla:
commitlog: use seastar::with_file_close_on_failure
commitlog: descriptor: make nothrow move constructible
commitlog: allocate_segment_ex, segment: pass descriptor by value
commitlog: allocate_segment_ex: filename capture is unused
In another patch I noticed gcc producing dead functions. I am not sure
why gcc is doing that. Some of those functions are already placed in
independent sections, and so can be garbage collected by the linker.
This is a 1% text section reduction in scylla, from 39363380 to
38974324 bytes. There is no difference in the tps reported by
perf_simple_query.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200723152511.8214-1-espindola@scylladb.com>
In the patch "Add exception overloads for Dynamo types", Alternator's single
api_error exception type was replaced by a more complex hierarchy of types.
The implementation was not only longer and more complex to understand -
I believe it also negated an important observation:
The "api_error" exception type is special. It is not an exception created
by code for other code. It is not meant to be caught in Alternator code.
Instead, it is supposed to contain an error message created for the *user*,
containing one of the few supported exception exception "names" described
in the DynamoDB documentation, and a user-readable text message. Throwing
such an exception in Alternator code means the thrower wants the request
to abort immediately, and this message to reach the user. These exceptions
are not designed to be caught in Alternator code. Code should use other
exceptions - or alternatives to exceptions (e.g., std::optional) for
problems that should be handled before returning a different error to the
user. Moreover, "api_error" isn't just thrown as an exception - it can
also be returned-by-value in a executor::request_return_type) - which is
another reason why it should not be subclassed.
For these reasons, I believe we should have a single api_error type, and
it's wrong to subclass it. So in this patch I am reverting the subclasses
and template added in the aforementioned patch.
Still, one correct observation made in that patch was that it is
inconvenient to type in DynamoDB exception names (no help from the editor
in completing those strings) and also error-prone. In this patch we
propse a different - simpler - solution to the same problem:
We add trivial factory functions, e.g., api_error::validation(std::string)
as a shortcut to api_error("ValidationException"). The new implementation
is easy to understand, and also more self explanatory to readers:
It is now clear that "api_error::validation()" is actually a user-visible
"api_error", something which was obscured by the name validation_exception()
used before this patch.
Finally, this patch also improves the comment in error.hh explaining the
purpose of api_error and the fact it can be returned or thrown. The fact
it should not be subclassed is legislated with a "finally". There is also
no point of this class inheriting from std::exception or having virtual
functions, or an empty constructor - so all these are dropped as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
* 'api-error-refactor' of https://github.com/nyh/scylla:
alternator: use api_error factory functions in auth.cc
alternator: use api_error::validation()
alternator: use api_error factory functions in executor.cc
alternator: use api_error factory functions in server.cc
alternator: refactor api_error class
It's better to assert a certain vector size first and only then
dereference its elements - otherwise, if a bug causes the size
to be different, the test can crash with a segfault on an invalid
dereference instead of graciously failing with a test assertion.
Generating bounds from multi-column restrictions used to create
incorrect nonwrapping intervals, which only happened to work because
they're implemented as wrapping intervals underneath.
The following CQL restriction:
WHERE (a, b) >= (1, 0)
should translate to
(a, b) >= (1, 0), no upper bound,
while it incorrectly translates to
(a, b) >= (1, 0) AND (a, b) < empty-prefix.
Since empty prefix is smaller than any other clustering key,
this range was in fact not correct, since the assumption
was that starting bound was never greater than the ending bound.
While the bug does not trigger any errors in tests right now,
it starts to do so after the code is modified in order to
correctly handle empty intervals (intervals with end > start).
To make sure it belongs to the same semaphore that the database thinks
is appropriate for the current query. Since a semaphore mismatch points
to a serious bug, we use `on_internal_error()` to allow generating
coredumps on-demand.
Currently inactive read handles are only unique within the same
semaphore, allowing for an unregister against another semaphore to
potentially succeed. This can lead to disasters ranging from crashes to
data corruption. While a handle should never be used with another
semaphore in the first place, we have recently seen a bug (#6613)
causing exactly that, so in this patch we prevent such unregister
operations from ever succeeding by making handles unique across all
semaphores. This is achieved by adding a pointer to the semaphore to the
handle.
All the places in auth.cc where we constructed an api_error with inline
strings now use api_error factory functions.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
All the places in conditions.cc, expressions.cc and serialization.cc where
we constructed an api_error, we always used the ValidationException type
string, which the code repeated dozens of times.
This patch converts all these places to use the factory function
api_error::validation().
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
All the places in executor.cc where we constructed an api_error with inline
strings now use api_error factory functions. Most of them, but not all of
them, were api_error::validation(). We also needed to add a couple more of
these factory functions.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
All the places in server.cc where we constructed an api_error with inline
strings now use api_error factory functions - we needed to add a few more.
Interestingly, we had a wrong type string for "Internal Server Error",
which we fix in this patch. We wrote the type string like that - with spaces -
because this is how it was listed in the DynamoDB documentation at
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html
But this was in fact wrong, and it should be without spaces:
"InternalServerError". The botocore library (for example) recognizes it
this way, and this string can also be seen in other online DynamoDB examples.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the patch "Add exception overloads for Dynamo types", Alternator's single
api_error exception type was replaced by a more complex hierarchy of types.
The implementation was not only longer and more complex to understand -
I believe it also negated an important observation:
The "api_error" exception type is special. It is not an exception created
by code for other code. It is not meant to be caught in Alternator code.
Instead, it is supposed to contain an error message created for the *user*,
containing one of the few supported exception exception "names" described
in the DynamoDB documentation, and a user-readable text message. Throwing
such an exception in Alternator code means the thrower wants the request
to abort immediately, and this message to reach the user. These exceptions
are not designed to be caught in Alternator code. Code should use other
exceptions - or alternatives to exceptions (e.g., std::optional) for
problems that should be handled before returning a different error to the
user. Moreover, "api_error" isn't just thrown as an exception - it can
also be returned-by-value in a executor::request_return_type) - which is
another reason why it should not be subclassed.
For these reasons, I believe we should have a single api_error type, and
it's wrong to subclass it. So in this patch I am reverting the subclasses
and template added in the aforementioned patch.
Still, one correct observation made in that patch was that it is
inconvenient to type in DynamoDB exception names (no help from the editor
in completing those strings) and also error-prone. In this patch we
propse a different - simpler - solution to the same problem:
We add trivial factory functions, e.g., api_error::validation(std::string)
as a shortcut to api_error("ValidationException"). The new implementation
is easy to understand, and also more self explanatory to readers:
It is now clear that "api_error::validation()" is actually a user-visible
"api_error", something which was obscured by the name validation_exception()
used before this patch.
Finally, this patch also improves the comment in error.hh explaining the
purpose of api_error and the fact it can be returned or thrown. The fact
it should not be subclassed is legislated with a "finally". There is also
no point of this class inheriting from std::exception or having virtual
functions, or an empty constructor - so all these are dropped as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"
There are 5 services, that register their RPC handlers in messaging
service, but quite a few of them unregister them on stop.
Unregistering is somewhat critical, not just because it makes the
code look clean, but also because unregistration does wait for the
message processing to complete, thus avoiding use-after-free's in
the handlers.
In particular, several handlers call service::get_schema_for_write()
which, in turn, may end up in service::maybe_sync() calling for
the local migration manager instance. All those handlers' processing
must be waited for before stopping the migration manager.
The set brings the RPC handlers unregistration in sync with the
registration part.
tests: unit (dev)
dtest (dev: simple_boot_shutdown, repair)
start-stop by hands (dev)
fixes: #6904
"
* 'br-rpc-unregister-verbs' of https://github.com/xemul/scylla:
main: Add missing calls to unregister RPC hanlers
messaging: Add missing per-service unregistering methods
messaging: Add missing handlers unregistration helpers
streaming: Do not use db->invoke_on_all in vain
storage_proxy: Detach rpc unregistration from stop
main: Shorten call to storage_proxy::init_messaging_service
Currently, we talk to a seed node in each gossip round with some
probability, i.e., nr_of_seeds / (nr_of_live_nodes + nr_of_unreachable_nodes)
For example, with 5 seeds in a 50 nodes cluster, the probability is 0.1.
Now that we talk to all live nodes, including the seed nodes, in a
bounded time period. It is not a must to talk to seed node in each
gossip round.
In order to get rid of the seed concept, do not talk to seed node
explicitly in each gossip round.
This patch is a preparatory patch to remove the seed concept in gossip.
Refs: #6845
Tests: update_cluster_layout_tests.py
Currently, we select 10 percent of random live nodes to talk with in
each gossip round. There is no upper bound how long it will take to talk
to all live nodes.
This patch changes the way we select live nodes to talk with as below:
1) Shuffle all the live endpoints randomly
2) Split the live endpoints into 10 groups
3) Talk to one of the groups in each gossip round
4) Go to step 1 to shuffle again after we groups are talked with
We keep both randomness of selecting nodes as before and determinacy to complete
talking to all live nodes.
In addition, the way to favor newly added node is simplified. When a
new live node is added, it is always added to the front of the group, so
it will be talked with in the next gossip round.
This patch is a preparatory patch to remove the seed concept in gossip.
Refs: #6845
Tests: update_cluster_layout_tests.py
Let each submodule be responsible for its own dependencies, and
call the submodule's dependency installation script.
Reviewed-by: Piotr Jastrzebski <piotr@scylladb.com>
Reviewed-by: Takuya ASADA <syuu@scylladb.com>
tools/java and tools/jmx have their own relocatable packages (and rpm/deb),
so they should not be part of the main relocatable package.
Enforce this by enabling the filter parameter in reloc_add, and passing
a filter that excludes tools/java and tools/jmx.
There is one monitor per writer, so we new keep them together in the
compaction_writer struct.
This trivially guarantees that the monitor is always destroyed before
the writer.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The gossiper's and migration_manager's unregistration is done on
the services' stopm, for the rest we need to call the recently
introduced methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
5 services register handlers in messaging, but not all of them
have clear unregistration methods.
Summary:
migration_manager: everything is in place, no changes
gossiper: ditto
proxy: some verbs unregistration is missing
repair: no unregistration at all
streaming: ditto
This patch adds the needed unregistration methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Handlers for each verb have both -- register and unregister helpers, but unregistration ones
for some verbs are missing, so here they are.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The proxy's stop method is not called (and unlikely will be soon), but stopping
the message handlers is needed now, so prepare the existing method for this.'
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If ring_delay == 0, something fishy is going on, e.g. single-node tests
are being performed. In this case we want the CDC generation to start
operating immediately. There is no need to wait until it propagates to
the cluster.
You should not use ring_delay == 0 in production.
Fixes https://github.com/scylladb/scylla/issues/6864.
"
This adds a '--io-setup N' command line option, which users can pass to
specify whether they want to run the "scylla_io_setup" script or not.
This is useful if users want to specify I/O settings themselves in
environments such as Kubernetes, where running "iotune" is problematic.
While at it, add the same option to "scylla_setup" to keep the interface
between that script and Docker consistent.
Fixes#6587
"
* penberg-penberg/docker-no-io-setup:
scylla_setup: Add '--io-setup ENABLE' command line option
dist/docker: Add '--io-setup ENABLE' command line option
seastar::make_lw_shared has a constructor taking a T&&. There is no
such constructor in std::make_shared:
https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared
This means that we have to move from
make_lw_shared(T(...)
to
make_lw_shared<T>(...)
If we don't want to depend on the idiosyncrasies of
seastar::make_lw_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This replaces a lot of make_lw_shared(schema(...)) with
make_shared_schema(...).
This makes it easier to drop a dependency on the differences between
seastar::make_shared and std::make_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
seastar::make_shared has a constructor taking a T&&. There is no such
constructor in std::make_shared:
https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared
This means that we have to move from
make_shared(T(...)
to
make_shared<T>(...)
If we don't want to depend on the idiosyncrasies of
seastar::make_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This moves a few calls to make_shared to a single location.
This makes it easier to drop a dependency on the differences between
seastar::make_shared and std::make_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This moves a few calls to make_shared to a single location.
This makes it easier to drop a dependency on the differences between
seastar::make_shared and std::make_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
* seastar 4a99d56453...02ad74fa7d (5):
> TLS: Use "known" (precalculated) DH parameters if available
> tutorial: fix advanced service_loop examples
> tutorial: further fix service_loop example text
> linux-aio: make the RWF_NOWAIT support work again
> locking_test: Fix a use after return
To make the "scylla_setup" interface similar to Docker image, let's add
a "--io-setup ENABLE" command line option. The old "--no-io-setup"
option is retained for compatibility.
This adds a '--io-setup N' command line option, which users can pass to
specify whether they want to run the "scylla_io_setup" script or not.
This is useful if users want to specify I/O settings themselves in
environments such as Kubernetes, where running "iotune" is problematic.
Fixes#6587
A test case which reproduces the view update generator hang, where the
staging reader consumes all resources and leaves none for the pre-image
reader which blocks on the semaphore indefinitely.
We recently saw a weird log message:
WARN 2020-07-19 10:22:46,678 [shard 0] repair - repair id [id=4,
uuid=0b1092a1-061f-4691-b0ac-547b281ef09d] failed: std::runtime_error
({shard 0: fmt::v6::format_error (invalid type specifier), shard 1:
fmt::v6::format_error (invalid type specifier)})
It turned out we have:
throw std::runtime_error(format("repair id {:d} on shard {:d} failed to
repair {:d} sub ranges", id, shard, nr_failed_ranges));
in the code, but we changed the id from integer to repair_uniq_id class.
We do not really need to specify the format specifiers for numbers.
Fixes#6874
To allow tests to reliably calculate the amount of resources they need
to consume in order to effectively reduce the resources of the semaphore
to a desired amount. Using `available_resources()` is not reliable as it
doesn't factor in resources that are consumed at the moment but will be
returned later.
This will also benefit debugging coredumps where we will now be able to
tell how much resources the semaphore was created with and this
calculate the amount of memory and count currently used.
So that tests can test the `view_update_consumer` in isolation, without
having to set up the whole database machinery. In addition to less
infrastructure setup, this allows more direct checking of mutations
pushed for view generation.
The view update generation process creates two readers. One is used to
read the staging sstables, the data which needs view updates to be
generated for, and another reader for each processed mutation, which
reads the current value (pre-image) of each row in said mutation. The
staging reader is created first and is kept alive until all staging data
is processed. The pre-image reader is created separately for each
processed mutation. The staging reader is not restricted, meaning it
does not wait for admission on the relevant reader concurrency
semaphore, but it does register its resource usage on it. The pre-image
reader however *is* restricted. This creates a situation, where the
staging reader possibly consumes all resources from the semaphore,
leaving none for the later created pre-image reader, which will not be
able to start reading. This will block the view building process meaning
that the staging reader will not be destroyed, causing a deadlock.
This patch solves this by making the staging reader restricted and
making it evictable. To prevent thrashing -- evicting the staging reader
after reading only a really small partition -- we only make the staging
reader evictable after we have read at least 1MB worth of data from it.
A variant of `make_range_sstable_reader()` that wraps the reader in a
restricting reader, hence making it wait for admission on the read
concurrency semaphore, before starting to actually read.
Staging SSTables can be incorrectly added or removed from the backlog tracker,
after an ALTER TABLE or TRUNCATE, because the add and removal don't take
into account if the SSTable requires view building, so a Staging SSTable can
be added to the tracker after a ALTER table, or removed after a TRUNCATE,
even though not added previously, potentially causing the backlog to
become negative.
Fixes#6798.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200716180737.944269-1-raphaelsc@scylladb.com>
test.py passes the "--junit-xml" option to test/alternator/run, which passes
this option to pytest to get an xunit-format summary of the test results.
However, unfortunately until very recent versions (which aren't yet in Linux
distributions), pytest defaulted to a non-standard xunit format which tools
like Jenkins couldn't properly parse. The more standard format can be chosen
by passing the option "-o junit_family=xunit2", so this is what we do in
this patch.
Fixes#6767 (hopefully).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200719203414.985340-1-nyh@scylladb.com>
"
The set's goal is to reduce the indirect fanout of 3 headers only,
but likely affects more. The measured improvement rates are
flat_mutation_reader.hh: -80%
mutation.hh : -70%
mutation_partition.hh : -20%
tests: dev-build, 'checkheaders' for changed headers (the tree-wide
fails on master)
"
* 'br-debloat-mutation-headers' of https://github.com/xemul/scylla:
headers:: Remove flat_mutation_reader.hh from several other headers
migration_manager: Remove db/schema_tables.hh inclustion into header
storage_proxy: Remove frozen_mutation.hh inclustion
storage_proxy: Move paxos/*.hh inclusions from .hh to .cc
storage_proxy: Move hint_wrapper from .hh to .cc
headers: Remove mutation.hh from trace_state.hh
schema_mutations
When upgrading from a version that lacks some schema features,
during the transition, when we have a mixed cluster. Schema digests
are calculated without taking into account the mixed cluster supported
features. Every node calculate the digest as if the whole cluster supports
its supported features.
Scylla already has a mechanism of redaction to the lowest common
denominator, but it haven't been used in this context.
This commit is using the redaction mechanism when calculating the digest on
the newly added table so it will match the supported features of the
whole cluster.
Tests: Manual upgrading - upgraded to a version with an additional
feature and additional schema column and validated that the digest
of the tables schema is identical on every node on the mixed cluster.
All they can live with forward declaration of the f._m._r. plus a
seastar header in commitlog code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The schema_tables.hh -> migration_manager.hh couple seems to work as one
of "single header for everyhing" creating big blot for many seemingly
unrelated .hh's.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's only used there, but requires mutation_query.hh, which can thus be
removed from storage_proxy.hh
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In test_streams.py, we had the line:
assert desc['StreamDescription'].get('StreamLabel')
In Alternator, the 'StreamLabel' attribute is missing, which the author of
this test probably thought would cause this test to fail (which is expected,
the test is marked with "xfail"). However, my version of pytest actually
doesn't like that assert is given a value instead of a comparison, and we
get the warning message:
PytestAssertRewriteWarning: asserting the value None, please use "assert is None"
I think that the nicest replacement for this line is
assert 'StreamLabel' in desc['StreamDescription']
This is very readable, "pythonic", and checks the right thing - it checks
that the JSON must include the 'StreamLabel' item, as the get() assertion
was supposed to have been doing.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200716124621.906473-1-nyh@scylladb.com>
If external is true, _u.ptr is not null. An empty managed_bytes uses
the internal representation.
The current code looks scary, since it seems possible that backref
would still point to the old location, which would invite corruption
when the reclaimer runs.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200716233124.521796-1-espindola@scylladb.com>
* seastar 0fe32ec59...4a99d5645 (3):
> httpd: Don't warn on ECONNABORTED
> httpd: Avoid calling future::then twice on the same future
Fixes#6709.
> futures: Add a test for a broken promise in repeat
Ninja has a special pool called console that causes programs in that
pool to output directly to the console instead of being logged. By
putting test.py in it is now possible to run just
$ ninja dev-test
And see the test.py output while it is running.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200716204048.452082-1-espindola@scylladb.com>
Besdies being more robust than passing const descriptor&
to continuations, this helps simplify making allocate_segment_ex's
continuations nothrow_move_constructible, that is need for using
seastar::with_file_close_on_failure().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
Fix#6825 by explicitly distinguishing single- from multi-column expressions in AST.
Tests: unit (dev), dtest secondary_indexes_test.py (dev)
"
* dekimir-single-multiple-ast:
cql3/restrictions: Separate AST for single column
cql3/restrictions: Single-column helper functions
Corresponding overload of `storage_proxy::mutate_locally`
was hardcoded to pass `db::commitlog::force_sync::no` to the
`database::apply`. Unhardcode it and substitute `force_sync::no`
to all existing call sites (as it were before).
`force_sync::yes` will be used later for paxos learn writes
when trying to apply mutations upgraded from an obsolete
schema version (similar to the current case when applying
locally a `frozen_mutation` stored in accepted proposal).
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200716124915.464789-1-pa.solodovnikov@scylladb.com>
"
We'd like to use the same uuid both for printing compaction log
messages and to update compaction_history.
Generate one when starting compaction and keep it in
compaction_info. Then use it by convention in all
compaction log messages, along with compaction type,
and keyspace.table information. Finally, use the
same uuid to update compaction_history.
Fixes#6840
"
* tag 'compaction-uuid-v1' of github.com:bhalevy/scylla:
compaction: print uuid in log messages
compaction: report_(start|finish): just return description
compaction: move compaction uuid generation to compaction_info
The targets {dev|debug|release}-test run all unit tests, including
alternator/run. But this test requires the Scylla executable, which
wasn't among the dependencies. Fix it by adding build/$mode/scylla to
the dependency list.
Fixes#6855.
Tests: `ninja dev-test` after removing build/dev/scylla
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
It is legal for a user to create a table with name that has a
_scylla_cdc_log suffix. In such case, the table won't be treated as a
cdc log table, and does not require a corresponding base table to exist.
During refactoring done as a part of initial implemetation of of
Alternator streams (#6694), `is_log_for_some_table` started throwing
when trying to check a name like `X_scylla_cdc_log` when there was no
table with name `X`. Previously, it just returned false.
The exception originates inside `get_base_table`, which tries to return
the base table schema, not checking for its existence - which may throw.
It makes more sense for this function to return nullptr in such case (it
already does when provided log table name does not have the cdc log
suffix), so this patch adds an explicit check and returns nullptr when
necessary.
A similar oversight happened before (see #5987), so this patch also adds
a comment which explains why existence of `X_scylla_cdc_log` does not
imply existence of `X`.
Fixes: #6852
Refs: #5724, #5987
By convention, print the following information
in all compaction log messages:
[{compaction.type} {keyspace}.{table} {compaction.uuid}]
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than logging the message in the virtual callee method
just return a string description and make the logger call in
the common caller.
1. There is no need to do the logger call in the callee,
it is simpler to format the log message in the the caller
and just retrieve the per-compaction-type description.
2. Prepare to centrally print the compaction uuid.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We'd like to use the same uuid both for printing compaction log
messages and to update compaction_history.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Existing AST assumes the single-column expression is a special case of
multi-column expressions, so it cannot distinguish `c=(0)` from
`(c)=(0)`. This leads to incorrect behaviour and dtest failures. Fix
it by separating the two cases explicitly in the AST representation.
Modify AST-creation code to create different AST for single- and
multi-column expressions.
Modify AST-consuming code to handle column_name separately from
vector<column_name>. Drop code relying on cardinality testing to
distinguisn single-column cases.
Add a new unit test for `c=(0)`.
Fixes#6825.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
This commit is separated out for ease of review. It introduces some
functions that subsequent commits will use, thereby reducing diff
complexity of those subsequent commits. Because the new functions
aren't invoked anywhere, they are guarded by `#if 0` to avoid
unused-function errors.
The new functions perform the same work as their existing namesakes,
but assuming single-column expressions. The old versions continue to
try handling single-column as a special case of multi-column,
remaining unable to distinguish between `c = (1)` and `(c) = (1)`.
This will change in the next commit, which will drop attempts to
handle single-column cases from existing functions.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Currently, repair uses an integer to identify a repair job. The repair
id starts from 1 since node restart. As a result, different repair jobs
will have same id across restart.
To make the id more unique across restart, we can use an uuid in
addition to the integer id. We can not drop the use of the integer id
completely since the http api and nodetool use it.
Fixes#6786
The markdown syntax in the contribution section is incorrect, which is
why the links appear on the same line.
Improve the contribution section by making it more explicit what the
links are about.
Message-Id: <20200714070716.143768-1-penberg@scylladb.com>
This test usually fails, with the following error. Marking it "xfail" until
we can get to the bottom of this.
dynamodb = dynamodb.ServiceResource()
dynamodbstreams = <botocore.client.DynamoDBStreams object at 0x7fa91e72de80>
def test_get_records(dynamodb, dynamodbstreams):
# TODO: add tests for storage/transactionable variations and global/local index
with create_stream_test_table(dynamodb, StreamViewType='NEW_AND_OLD_IMAGES') as table:
arn = wait_for_active_stream(dynamodbstreams, table)
p = 'piglet'
c = 'ninja'
val = 'lucifers'
val2 = 'flowers'
> table.put_item(Item={'p': p, 'c': c, 'a1': val, 'a2': val2})
test_streams.py:316:
...
E botocore.exceptions.ClientError: An error occurred (Internal Server Error) when calling the PutItem operation (reached max retries: 3): Internal server error: std::runtime_error (cdc::metadata::get_stream: could not find any CDC stream (current time: 2020/07/15 17:26:36). Are we in the middle of a cluster upgrade?)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6694
by Calle Wilund:
Implementation of DynamoDB streams using Scylla CDC.
Fixes#5065
Initial, naive implementation insofar that it uses 1:1 mapping CDC stream to
DynamoDB shard. I.e. there are a lot of shards.
Includes tests verified against both local DynamoDB server and actual AWS
remote one.
Note:
Because of how data put is implemented in alternator, currently we do not
get "proper" INSERT labels for first write of data, because to CDC it looks
like an update. The test compensates for this, but actual users might not
like it.
In the script test/alternator/run, which runs Scylla for the Alternator
tests, add the "--experimental-features=cdc" option, to allow us testing
the streams API whose implementation requires the experimenal CDC feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
* tools/java 50dbf77123...3eca0e3511 (1):
> config: Avoid checking options and filtering scylla.yaml
* tools/jmx 5820992...aa94fe5 (3):
> dist: redhat: reduce log spam from unpacking sources when building rpm
> Merge 'gitignore: fix typo and add scylla-apiclient/target/' from Benny
> apiclient: Bump Jackson version to 2.10.4
On .deb package with new relocatable package format, all files moved to
under scylla/ directory.
So we need to call ./scylla/install.sh on debian/rules, but it does not work
correctly, since install.sh does not support calling from other directory.
To support this, we need to changedir to scylla top directory before
copying files.
Merge pull request https://github.com/scylladb/scylla/pull/6834 by
Juliusz Stasiewicz:
NULLs used to give false positives in GT, LT, GEQ and LEQ ops performed upon
ALLOW FILTERING. That was a consequence of not distinguishing NULL from an
empty buffer.
This patch excludes NULLs on high level, preventing them from entering LHS
of any comparison, i.e. it assumes that any binary operation should return
false whenever the LHS operand is NULL (note: at the moment filters with
RHS NULL, such as ...WHERE x=NULL ALLOW FILTERING, return empty sets anyway).
Fixes#6295
* '6295-do-not-compare-nulls-v2' of github.com:jul-stas/scylla:
filtering_test: check that NULLs do not compare to normal values
cql3/restrictions: exclude NULLs from comparison in filtering
The data model is now
bplus::tree<Key = int64_t, T = array<entry>>
where entry can be cache_entry or memtable_entry.
The whole thing is encapsulated into a collection called "double_decker"
from patch #3. The array<T> is an array of T-s with 0-bytes overhead used
to resolve hash conflicts (patch #2).
branch:
tests: unit(debug)
tests before v7:
unit(debug) for new collections, memtable and row_cache
unit(dev) for the rest
perf(dev)
* https://github.com/xemul/scylla/commits/row-cache-over-bptree-9:
test: Print more sizes in memory_footprint_test
memtable: Switch onto B+ rails
row_cache: Switch partition tree onto B+ rails
memtable: Count partitions separately
token: Introduce raw() helper and raw comparator
row-cache: Use ring_position_comparator in some places
dht: Detach ring_position_comparator_for_sstables
double-decker: A combination of B+tree with array
intrusive-array: Array with trusted bounds
utils: B+ tree implementation
test: Move perf measurement helpers into header
impl_count_function doesn't explicitly initialize _count, so its correctness
depends on default initialization. Let's explicitly initialize _count to
make the code future proof.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200714162604.64402-1-raphaelsc@scylladb.com>
Option to control the alternator streams CDC query/shard range time
confidence interval, i.e. the period we enforce as timestamp threshold
when reading. The default, 10s, should be sufficient on a normal
cluster, but across DCs:, or with client timestamps or whatever, one might
need a larger window.
Subroutines needed by (in this case) streams implementation
moved from being file-static to class-static (exported).
To make putting handler routines in separate sources possible.
Because executor.cc is large and slow to compile.
Separation is nice.
Unfortunately, not all methods can be kept class-private,
since unrelated types also use them.
Reviewer suggested to instead place there is a top-level
header for export, i.e. not class-private at all.
I am skipping that for now, mainly because I can't come up
with a good file name. Can be part of a generate refactor
of helper routine organization in executor.
Add types exception overloads for ValidationException, ResourceNotFoundException, etc,
to avoid writing explicit error type as string everywhere (with the potential for
spelling errors ever present).
Also allows intellisense etc to complete the exception when coded.
To allow immediate json value conversion for types we
have TypeHelper<...>:s for.
Typed opt-get to get both automatic type conversion, _and_
find functionality in one call.
Tested operators are: `<` and `>`. Tests all types of NULLs except
`duration` because durations are explicitly not comparable. Values
to compare against were chosen arbitrarily.
The row cache memory footprint changed after switch to B+
because we no longer have a sole cache_entry allocation, but
also the bplus::data and bplus::node. Knowing their sizes
helps analyzing the footprint changes.
Also print the size of memtable_entry that's now also stored
in B+'s data.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The change is the same as with row-cache -- use B+ with int64_t token
as key and array of memtable_entry-s inside it.
The changes are:
Similar to those for row_cache:
- compare() goes away, new collection uses ring_position_comparator
- insertion and removal happens with the help of double_decker, most
of the places are about slightly changed semantics of it
- flags are added to memtable_entry, this makes its size larger than
it could be, but still smaller than it was before
Memtable-specific:
- when the new entry is inserted into tree iterators _might_ get
invalidated by double-decker inner array. This is easy to check
when it happens, so the invalidation is avoided when possible
- the size_in_allocator_without_rows() is now not very precise. This
is because after the patch memtable_entries are not allocated
individually as they used to. They can be squashed together with
those having token conflict and asking allocator for the occupied
memory slot is not possible. As the closest (lower) estimate the
size of enclosing B+ data node is used
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The row_cache::partitions_type is replaced from boost::intrusive::set
to bplus::tree<Key = int64_t, T = array_trusted_bounds<cache_entry>>
Where token is used to quickly locate the partition by its token and
the internal array -- to resolve hashing conflicts.
Summary of changes in cache_entry:
- compare's goes away as the new collection needs tri-compare one which
is provided by ring_position_comparator
- when initialized the dummy entry is added with "after_all_keys" kind,
not "before_all_keys" as it was by default. This is to make tree
entries sorted by token
- insertion and removing of cache_entries happens inside double_decker,
most of the changes in row_cache.cc are about passing constructor args
from current_allocator.construct into double_decker.empace_before()
- the _flags is extended to keep array head/tail bits. There's a room
for it, sizeof(cache_entry) remains unchanged
The rest fits smothly into the double_decker API.
Also, as was told in the previous patch, insertion and removal _may_
invalidate iterators, but may leave them intact. However, currently
this doesn't seem to be a problem as the cache_tracker ::insert() and
::on_partition_erase do invalidate iterators unconditionally.
Later this can be otimized, as iterators are invalidated by double-decker
only in case of hash conflict, otherwise it doesn't change arrays and
B+ tree doesn't invalidate its.
tests: unit(dev), perf(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In next patches the entries having token on-board will be
moved onto B+-tree rails. For this the int64_t value of the
token will be used as B+ key, so prepare for this.
One corner case -- the after_all_keys tokens must be resolved
to int64::max value to appear at the "end" of the tree. This
is not the same as "before_all_keys" case, which maps to the
int64::min value which is not allowed for regular tokens. But
for the sake of B+ switch this is OK, the conflicts of token
raw values are explicitly resolved in next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The row cache (and memtable) code uses own comparators built on top
of the ring_position_comparator for collections of partitions. These
collections will be switched from the key less-compare to the pair
of token less-compare + key tri-compare.
Prepare for the switch by generalizing the ring_partition_comparator
and by patching all the non-collections usage of less-compare to use
one.
The memtable code doesn't use it outside of collections, but patch it
anyway as a part of preparations.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will generalize ring_position_comparator with templates
to replace cache_entry's and memtable_entry's comparators. The overload
of operator() for sstables has its own implementation, that differs from
the "generic" one, for smoother generalization it's better to detach it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The collection is K:V store
bplus::tree<Key = K, Value = array_trusted_bounds<V>>
It will be used as partitions cache. The outer tree is used to
quickly map token to cache_entry, the inner array -- to resolve
(expected to be rare) hash collisions.
It also must be equipped with two comparators -- less one for
keys and full one for values. The latter is not kept on-board,
but it required on all calls.
The core API consists of just 2 calls
- Heterogenuous lower_bound(search_key) -> iterator : finds the
element that's greater or equal to the provided search key
Other than the iterator the call returns a "hint" object
that helps the next call.
- emplace_before(iterator, key, hint, ...) : the call construct
the element right before the given iterator. The key and hint
are needed for more optimal algo, but strictly speaking not
required.
Adding an entry to the double_decker may result in growing the
node's array. Here to B+ iterator's .reconstruct() method
comes into play. The new array is created, old elements are
moved onto it, then the fresh node replaces the old one.
// TODO: Ideally this should be turned into the
// template <typename OuterCollection, typename InnerCollection>
// but for now the double_decker still has some intimate knowledge
// about what outer and inner collections are.
Insertion into this collection _may_ invalidate iterators, but
may leave intact. Invalidation only happens in case of hashing
conflict, which can be clearly seen from the hint object, so
there's a good room for improvement.
The main usage by row_cache (the find_or_create_entry) looks like
cache_entry find_or_create_entry() {
bound_hint hint;
it = lower_bound(decorated_key, &hint);
if (!hint.found) {
it = emplace_before(it, decorated_key.token(), hint,
<constructor args>)
}
return *it;
}
Now the hint. It contains 3 booleans, that are
- match: set to true when the "greater or equal" condition
evaluated to "equal". This frees the caller from the need
to manually check whether the entry returned matches the
search key or the new one should be inserted.
This is the "!found" check from the above snippet.
To explain the next 2 bools, here's a small example. Consider
the tree containing two elements {token, partition key}:
{ 3, "a" }, { 5, "z" }
As the collection is sorted they go in the order shown. Next,
this is what the lower_bound would return for some cases:
{ 3, "z" } -> { 5, "z" }
{ 4, "a" } -> { 5, "z" }
{ 5, "a" } -> { 5, "z" }
Apparently, the lower bound for those 3 elements are the same,
but the code-flows of emplacing them before one differ drastically.
{ 3, "z" } : need to get previous element from the tree and
push the element to it's vector's back
{ 4, "a" } : need to create new element in the tree and populate
its empty vector with the single element
{ 5, "a" } : need to put the new element in the found tree
element right before the found vector position
To make one of the above decisions the .emplace_before would need
to perform another set of comparisons of keys and elements.
Fortunately, the needed information was already known inside the
lower_bound call and can be reported via the hint.
Said that,
- key_match: set to true if tree.lower_bound() found the element
for the Key (which is token). For above examples this will be
true for cases 3z and 5a.
- key_tail: set to true if the tree element was found, but when
comparing values from array the bounding element turned out
to belong to the next tree element and the iterator was ++-ed.
For above examples this would be true for case 3z only.
And the last, but not least -- the "erase self" feature. Which is
given only the cache_entry pointer at hands remove it from the
collection. To make this happen we need to make two steps:
1. get the array the entry sits in
2. get the b+ tree node the vectors sits in
Both methods are provided by array_trusted_bounds and bplus::tree.
So, when we need to get iterator from the given T pointer, the algo
looks like
- Walk back the T array untill hitting the head element
- Call array_trusted_bounds::from_element() getting the array
- Construct b+ iterator from obtained array
- Construct the double_decker iterator from b+ iterator and from
the number of "steps back" from above
- Call double_decker::iterator.erase()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A plain array of elements that grows and shrinks by
constructing the new instance from an existing one and
moving the elements from it.
Behaves similarly to vector's external array, but has
0-bytes overhead. The array bounds (0-th and N-th
elemements) are determined by checking the flags on the
elements themselves. For this the type must support
getters and setters for the flags.
To remove an element from array there's also a nothrow
option that drops the requested element from array,
shifts the righter ones left and keeps the trailing
unused memory (so called "train") until reconstruction
or destruction.
Also comes with lower_bound() helper that helps keeping
the elements sotred and the from_element() one that
returns back reference to the array in which the element
sits.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
// The story is at
// https://groups.google.com/forum/#!msg/scylladb-dev/sxqTHM9rSDQ/WqwF1AQDAQAJ
This is the B+ version which satisfies several specific requirements
to be suitable for row-cache usage.
1. Insert/Remove doesn't invalidate iterators
2. Elements should be LSA-compactable
3. Low overhead of data nodes (1 pointer)
4. External less-only comparator
5. As little actions on insert/delete as possible
6. Iterator walks the sorted keys
The design, briefly is:
There are 3 types of nodes: inner, leaf and data, inner and leaf
keep build-in array of N keys and N(+1) nodes. Leaf nodes sit in
a doubly linked list. Data nodes live separately from the leaf ones
and keep pointers on them. Tree handler keeps pointers on root and
left-most and right-most leaves. Nodes do _not_ keep pointers or
references on the tree (except 3 of them, see below).
changes in v9:
- explicitly marked keys/kids indices with type aliases
- marked the whole erase/clear stuff noexcept
- disposers now accept object pointer instead of reference
- clear tree in destructor
- added more comments
- style/readability review comments fixed
Prior changes
**
- Add noexcepts where possible
- Restrict Less-comparator constraint -- it must be noexcept
- Generalized node_id
- Packed code for beging()/cbegin()
**
- Unsigned indices everywhere
- Cosmetics changes
**
- Const iterators
- C++20 concepts
**
- The index_for() implmenetation is templatized the other way
to make it possible for AVX key search specialization (further
patching)
**
- Insertion tries to push kids to siblings before split
Before this change insertion into full node resulted into this
node being split into two equal parts. This behaviour for random
keys stress gives a tree with ~2/3 of nodes half-filled.
With this change before splitting the full node try to push one
element to each of the siblings (if they exist and not full).
This slows the insertion a bit (but it's still way faster than
the std::set), but gives 15% less total number of nodes.
- Iterator method to reconstruct the data at the given position
The helper creates a new data node, emplaces data into it and
replaces the iterator's one with it. Needed to keep arrays of
data in tree.
- Milli-optimize erase()
- Return back an iterator that will likely be not re-validated
- Do not try to update ancestors separation key for leftmost kid
This caused the clear()-like workload work poorly as compared to
std:set. In particular the row_cache::invalidate() method does
exactly this and this change improves its timing.
- Perf test to measure drain speed
- Helper call to collect tree counters
**
- Fix corner case of iterator.emplace_before()
- Clean heterogenous lookup API
- Handle exceptions from nodes allocations
- Explicitly mark places where the key is copied (for future)
- Extend the tree.lower_bound() API to report back whether
the bound hit the key or not
- Addressed style/cleanness review comments
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
1. storage_service: Make handle_state_left more robust when tokens are empty
In case the tokens for the node to be removed from the cluster are
empty, log the application_state of the leaving node to help understand
why the tokens are empty and try to get the tokens from token_metadata.
2. token_metadata: Do not throw if empty tokens are passed to remove_bootstrap_tokens
Gossip on_change callback calls storage_service::excise which calls
remove_bootstrap_tokens to remove the tokens of the leaving node from
bootstrap tokens. If empty tokens, e.g., due to gossip propagation issue
as we saw in #6468, are passed
to remove_bootstrap_tokens, it will throw. Since the on_change callback
is marked as noexcept, such throw will cause the node to terminate which
is an overkill.
To avoid such error causing the whole cluster to down in worse cases,
just log the tokens are empty passed to remove_bootstrap_tokens.
Refs #6468
Currently, when update_normal_tokens is called, a warning logged.
Token X changing ownership from A to B
It is not correct to log so because we can call update_normal_tokens
against a temporary token_metadata object during topology calculation.
Refs: #6437
NULLs used to give false positives in GT, LT, GEQ and LEQ ops
performed upon `ALLOW FILTERING`. That was a consequence of
not distinguishing NULL from an empty buffer. This patch excludes
NULLs on high level, preventing them from entering any comparison,
i.e. it assumes that any binary operator should return `false`
whenever one of the operands is NULL (note: ATM filters such as
`...WHERE x=NULL ALLOW FILTERING` return empty sets anyway).
`restriction_test/regular_col_slice` had to be updated accordingly.
Fixes#6295
The default ninja build target now builds artifacts and packages. Let's
add a 'build' target that only builds the artifacts.
Message-Id: <20200714105042.416698-1-penberg@scylladb.com>
Consider a cluster with two nodes:
- n1 (dc1)
- n2 (dc2)
A third node is bootstrapped:
- n3 (dc2)
The n3 fails to bootstrap as follows:
[shard 0] init - Startup failed: std::runtime_error
(bootstrap_with_repair: keyspace=system_distributed,
range=(9183073555191895134, 9196226903124807343], no existing node in
local dc)
The system_distributed keyspace is using SimpleStrategy with RF 3. For
the keyspace that does not use NetworkTopologyStrategy, we should not
require the source node to be in the same DC.
Fixes: #6744
Backports: 4.0 4.1, 4.2
This adds a short project description to README to make the git
repository more discoverable. The text is an edited version of a Scylla
blurb provided by Peter Corless.
Message-Id: <20200714065726.143147-1-penberg@scylladb.com>
This patch changes the per table latencies histograms: read, write,
cas_prepare, cas_accept, and cas_learn.
Beside changing the definition type and the insertion method, the API
was changed to support the new metrics.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch changes the row locking latencies to use
time_estimated_histogram.
The change consist of changing the histogram definition and changing how
values are inserted to the histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch moves the alternator latencies histograms to use the time_estimated_histogram.
The changes requires changing the defined type and use the simpler
insertion method.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
In case a row hash conflict, a hash in set_diff will get more than one
row from get_row_diff.
For example,
Node1 (Repair master):
row1 -> hash1
row2 -> hash2
row3 -> hash3
row3' -> hash3
Node2 (Repair follower):
row1 -> hash1
row2 -> hash2
We will have set_diff = {hash3} between node1 and node2, while
get_row_diff({hash3}) will return two rows: row3 and row3'. And the
error below was observed:
repair - Got error in row level repair: std::runtime_error
(row_diff.size() != set_diff.size())
In this case, node1 should send both row3 and row3' to peer node
instead of fail the whole repair. Because node2 does not have row3 or
row3', otherwise node1 won't send row with hash3 to node1 in the first
place.
Refs: #6252
The test/alternator/run script creates a temporary directory for the Scylla
database in /tmp. The assumption was that this is the fastest disk (usually
even a ramdisk) on the test machine, and we didn't need anything else from
it.
But it turns out that on some systems, /tmp is actually a slow disk, so
this patch adds a way to configure the temporary directory - if the TMPDIR
environment variable exists, it is used instead of /tmp. As before this
patch, a temporary subdirectry is created in $TMPDIR, and this subdirectory
is automatically deleted when the test ends.
The test.py script already passes an appropriate TMPDIR (testlog/$mode),
which after this patch the Alternator test will use instead of /tmp.
Fixes#6750
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200713193023.788634-1-nyh@scylladb.com>
Export TMPDIR environment variable pointing at a subdir of testlog.
This variable is used by seastar/scylla tests to create a
a subdirectory with temporary test data. Normally a test cleans
up the temporary directory, but if it crashes or is killed the
directory remains.
By resetting the default location from /tmp to testlog/{mode}
we allow test.py we consolidate all test artefacts in a single
place.
Fixes#6062, "test.py uses tmpfs"
* seastar 5632cf2146...0fe32ec596 (11):
> futures: Add a test for a broken promise in a parallel_for_each
> future: Simplify finally_body implementation
> futures_test: Extend nested_exception test
> Merge "make gate methods noexcept" from Benny
> tutorial: fix service_loop example
> sharded: fix doxygen \example clause for sharded_parameter
> Merge "future: Don't call need_preempt in 'then' and 'then_impl'" from Rafael
> future: Refactor a bit of duplicated code
> Merge "Add with_file helpers" from Benny
> Merge "Fix doxygen warnings" from Benny
> build: add doxygen to install-dependencies.sh
While fetching CDC generations, various exceptions can occur. They
are divided into "fatal" and "nonfatal", where "fatal" ones prevent
retrying of the fetch operation.
This patch makes `read_failure_exception` "non-fatal", because such
error may appear during restart. In general this type of error can
mean a few different things (e.g. an error code in a response from
replica, but also a broken connection) so retrying seems reasonable.
Fixes#6804
Consolidate the Docker image build instructions into the "Building Scylla"
section of the README instead of having it in a separate section in a different
place of the file.
Message-Id: <20200713132600.126360-1-penberg@scylladb.com>
Currently, if a user tries to CreateTable with a forbidden set of tags,
e.g., the Tags list is too long or contains an invalid value for
system:write_isolation, then the CreateTable request fails but the table
is still created. Without the tag of course.
This patch fixes this bug, and adds two test cases for it that fail
before this patch, and succeed with it. One of the test cases is
scylla_only because it checks the Scylla-specific system:write_isolation
tag, but the second test case works on DynamoDB as well.
What this patch does is to split the update_tags() function into two
parts - the first part just parses the Tags, validates them, and builds
a map. Only the second part actually writes the tags to the schema.
CreateTable now does the first part early, before creating the table,
so failure in parsing or validating the Tags will not leave a created
table behind.
Fixes#6809.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200713120611.767736-1-nyh@scylladb.com>
When tests are run in parallel, it is hard to tell how much time each test
ran. The time difference between consecutive printouts (indicating a test's
end) says nothing about the test's duration.
This patch adds in "--verbose" mode, at the end of each test result, the
duration in seconds (in wall-clock time) of the test. For example,
$ ./test.py --mode dev --verbose alternator
================================================================================
[N/TOTAL] TEST MODE RESULT
------------------------------------------------------------------------------
[1/2] boost/alternator_base64_test dev [ PASS ] 0.02s
[2/2] alternator/run dev [ PASS ] 26.57s
These durations are useful for recognizing tests which are especially slow,
or runs where all the tests are unusually slow (which might indicate some
sort of misconfiguration of the test machine).
Fixes#6759
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200706142109.438905-1-nyh@scylladb.com>
This adds a new "dist-<mode>" target, which builds the server package in
selected build mode together with the other packages, and wires it to
the "<mode>" target, which is built as part of default "ninja"
invocation.
This allows us to perform a full build, package, and test cycle across
all build modes with:
./configure.py && ninja && ./test.py
Message-Id: <20200713101918.117692-1-penberg@scylladb.com>
Improve the "pull_github_pr.sh" to detect the number of commits in a
pull request, and use "git cherry-pick" to merge single-commit pull
requests.
Message-Id: <20200713093044.96764-1-penberg@scylladb.com>
There are a bunch of renames that are done if PRODUCT is not the
default, but the Python code for them is incorrect. Path.glob()
is not a static method, and Path does not support .endswith().
Fix by constructing a Path object, and later casting to str.
Let's report each missing CPU feature individually, and improve the
error message a bit. For example, if the "clmul" instruction is missing,
the report looks as follows:
ERROR: You will not be able to run Scylla on this machine because its CPU lacks the following features: pclmulqdq
If this is a virtual machine, please update its CPU feature configuration or upgrade to a newer hypervisor.
Fixes#6528
Add links to the users and developers mailing lists, and the Slack
channel in README.md to make them more discoverable.
Message-Id: <20200713074654.90204-1-penberg@scylladb.com>
Simplify the build and run instructions by splitting the text in three
sections (prerequisites, building, and running) and streamlining the
steps a bit.
Message-Id: <20200713065910.84582-1-penberg@scylladb.com>
scylla fiber often fails to really unwind the entire fiber, stopping
sooner than expected. This is expected as scylla fiber only recognizes
the most standard continuations but can drop the ball as soon as there
is an unusual transmission.
This commits adds a message below the found tasks explaining that the
list might not be exhaustive and prints a command which can be used to
explain why the unwinding stopped at the last task.
While at it also rephrase an out-of-date comment.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200710120813.100009-1-bdenes@scylladb.com>
As requested in #5763 feedback, require that Fn be callable with
binary_operator in the functions mentioned above.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
The problem is that this option is defined in seastar testing wrapper,
while no unit tests use it, all just start themselves with app.run() and
would complain on unknown option.
"Would", because nowadays every single test in it declares its own options
in suite.yaml, that override test.py's defaults. Once an option-less unit
test is added (B+ tree ones) it will complain.
The proposal is to remove this option from defaults, if any unit test will
use the seastar testing wrappers and will need this option, it can add one
to the suite.yaml.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200709084602.8386-1-xemul@scylladb.com>
The goal is to make the lambdas, that are fed into partition cache's
clear_and_dispose() and erase_in_dispose(), to be noexcept.
This is to satisfy B+, which strictly requires those to be noexcept
(currently used collections don't care).
The set covers not only the strictly required minimum, but also some
other methods that happened to be nearby.
* https://github.com/xemul/scylla/tree/br-noexcepts-over-the-row-cache:
row_cache: Mark invalidation lambda as noexcept
cache_tracker: Mark methods noexcept
cache_entry: Mark methods noexcept
region: Mark trivial noexcept methods as such
allocation_strategy: Mark returning lambda as noexcept
allocation_strategy: Mark trivial noexcept methods as such
dht: Mark noexcept methods
The "NULL" operator in Expected (old-style conditional operations) doesn't
have any parameters, so we insisted that the AttributeValueList be empty.
However, we forgot to allow it to also be missing - a possibility which
DynamoDB allows.
This patch adds a test to reproduce this case (the test passes on DyanmoDB,
fails on Alternator before this patch, and succeeds after this patch), and
a fix.
Fixes#6816.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200709161254.618755-1-nyh@scylladb.com>
If any of the compared bytes_view's is empty
consider the empty prefix is same and proceed to compare
the size of the suffix.
A similar issue exists in legacy_compound_view::tri_comparator::operator().
It too must not pass nullptr to memcmp if any of the compared byte_view's
is empty.
Fixes#6797
Refs #6814
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Test: unit(dev)
Branches: all
Message-Id: <20200709123453.955569-1-bhalevy@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6741
by Piotr Dulikowski:
This PR changes the algorithm used to generate preimages and postimages
in CDC log. While its behavior is the same for non-batch operations
(with one exception described later), it generates pre/postimages that
are organized more nicely, and account for multiple updates to the same
row in one CQL batch.
Fixes#6597, #6598
Tests:
- unit(dev), for each consecutive commit
- unit(debug), for the last commit
Previous method
The previous method worked on a per delta row basis. First, the base
table is queried for the current state of the rows being modified in
the processed mutation (this is called the "preimage query"). Then,
for each delta row (representing a modification of a row):
If preimage is enabled and the row was already present in the table,
a corresponding preimage row is inserted before the delta row.
The preimage row contains data taken directly from the preimage
query result. Only columns that are modified by the delta are
included in the preimage.
If postimage is enabled, then a postimage row is inserted after the
delta row. The postimage row contains data which was a result of
taking row data directly from the preimage query result and applying
the change the corresponding delta row represented. All columns
of the row are included in the postimage.
The above works well for simple cases such like singular CQL INSERT,
UPDATE, DELETE, or simple CQL BATCH-es. An example:
cqlsh:ks> BEGIN UNLOGGED BATCH
INSERT INTO tbl (pk, ck, v) VALUES (0, 1, 111);
INSERT INTO tbl (pk, ck, v) VALUES (0, 2, 222);
APPLY BATCH;
cqlsh:ks> SELECT "cdc$batch_seq_no", "cdc$operation", "cdc$ttl",
pk, ck, v from ks.tbl_scylla_cdc_log ;
cdc$batch_seq_no | cdc$operation | cdc$ttl | pk | ck | v
------------------+---------------+---------+----+----+-----
...snip...
0 | 0 | null | 0 | 1 | 100
1 | 2 | null | 0 | 1 | 111
2 | 9 | null | 0 | 1 | 111
3 | 0 | null | 0 | 2 | 200
4 | 2 | null | 0 | 2 | 222
5 | 9 | null | 0 | 2 | 222
Preimage rows are represented by cdc operation 0, and postimage by 9.
Please note that all rows presented above share the same value of
cdc$time column, which was not shown here for brevity.
Problems with previous approach
This simple algorithm has some conceptual and implementational problems
which arise when processing more complicated CQL BATCH-es. Consider
the following example:
cqlsh:ks> BEGIN UNLOGGED BATCH
INSERT INTO tbl (pk, ck, v1) VALUES (0, 0, 1) USING TTL 1000;
INSERT INTO tbl (pk, ck, v2) VALUES (0, 0, 2) USING TTL 2000;
APPLY BATCH;
cqlsh:ks> SELECT "cdc$batch_seq_no", "cdc$operation", "cdc$ttl",
pk, ck, v1, v2 FROM tbl_scylla_cdc_log;
cdc$batch_seq_no | cdc$operation | cdc$ttl | pk | ck | v1 | v2
------------------+---------------+---------+----+----+------+------
...snip...
0 | 0 | null | 0 | 0 | null | 0
1 | 2 | 2000 | 0 | 0 | null | 2
2 | 9 | null | 0 | 0 | 0 | 2
3 | 0 | null | 0 | 0 | 0 | null
4 | 1 | 1000 | 0 | 0 | 1 | null
5 | 9 | null | 0 | 0 | 1 | 0
A single cdc group (corresponding to rows sharing the same cdc$time)
might have more than one delta that modify the same row. For example,
this happens when modifying two columns of the same row with
different TTLs - due to our choice of CDC log schema, we must
represent such change with two delta rows.
It does not make sense to present a postimage after the first delta
and preimage before the second - both deltas are applied
simultaneously by the same CQL BATCH, so the middle "image" is purely
imaginary and does not appear at any point in the table.
Moreover, in this example, the last postimage is wrong - v1 is updated,
but v2 is not. None of the postimages presented above represent the
final state of the row.
New algorithm
The new algorithm works now on per cdc group basis, not delta row.
When starting processing a CQL BATCH:
Load preimage query results into a data structure representing
current state of the affected rows.
For each cdc group:
For each row modified within the group, a preimage is produced,
regardless if the row was present in the table. The preimage
is calculated based on the current state. Only include columns
that are modified for this row within the group.
For each delta, produce a delta row and update the current state
accordingly.
Produce postimages in the same way as preimages - but include all
columns for each row in the postimage.
The new algorithm produces postimage correctly when multiple deltas
affect one, because the state of the row is updated on the fly.
This algorithm moves preimage and postimage rows to the beginning and
the end of the cdc group, accordingly. This solves the problem of
imaginary preimages and postimages appearing inside a cdc group.
Unfortunately, it is possible for one CQL BATCH to contain changes that
use multiple timestamps. This will result in one CQL BATCH creating
multiple cdc groups, with different cdc$time. As it is impossible, with
our choice of schema, to tell that those cdc groups were created from
one CQL BATCH, instead we pretend as if those groups were separate CQL
operations. By tracking the state of the affected rows, we make sure
that preimage in later groups will reflect changes introduces in
previous groups.
One more thing - this algorithm should have the same results for
singular CQL operations and simple CQL BATCH-es, with one exception.
Previously, preimage not produced if a row was not present in the
table. Now, the preimage row will appear unconditionally - it will have
nulls in place of column values.
* 'cdc-pre-postimage-persistence' of github.com:piodul/scylla:
cdc: fix indentation
cdc: don't update partition state when not needed
cdc: implement pre/postimage persistence
cdc: add interface for producing pre/postimages
cdc: load preimage query result into partition state fields
cdc: introduce fields for keeping partition state
cdc: rename set_pk_columns -> allocate_new_log_row
cdc: track batch_no inside transformer
cdc: move cdc$time generation to transformer
cdc: move find_timestamp to split.cc
cdc: introduce change_processor interface
cdc: remove redundant schema arguments from cdc functions
cdc: move management of generated mutations inside transformer
cdc: move preimage result set into a field of transformer
cdc: keep ts and tuuid inside transformer
cdc: track touched parts of mutations inside transformer
cdc: always include preimage for affected rows
All but few are trivially such.
The clear_continuity() calls cache_entry::set_continuous() that had become noexcept
a patch ago.
The allocator() calls region.allocator() which had been marked noexcept few patches
back.
The on_partition_erase() calls allocator().invalidate_references(), both had
been marked noexcept few patches back.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All but one are trivially such, the position() one calls is_dummy_entry()
which has become noexcept right now.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
cquery_nofail returns the query result, not a future. Invoking .get()
on its result is unnecessary. This just happened to compile because
shared_ptr has a get() method with the same signature as future::get.
Tests: cql_query_test unit test (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When "trace"-level logging is enabled for Alternator, we log every request,
but currently only the request's body. For debugging, it is sometimes useful
to also see the headers - which are important to debug authentication,
for example. So let's print the headers as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200709103414.599883-1-nyh@scylladb.com>
In one of the longevity tests, we observed 1.3s reactor stall which came from
repair_meta::get_full_row_hashes_source_op. It traced back to a call to
std::unordered_set::insert() which triggered big memory allocation and
reclaim.
I measured std::unordered_set, absl::flat_hash_set, absl::node_hash_set
and absl::btree_set. The absl::btree_set was the only one that seastar
oversized allocation checker did not warn in my tests where around 300K
repair hashes were inserted into the container.
- unordered_set:
hash_sets=295634, time=333029199 ns
- flat_hash_set:
hash_sets=295634, time=312484711 ns
- node_hash_set:
hash_sets=295634, time=346195835 ns
- btree_set:
hash_sets=295634, time=341379801 ns
The btree_set is a bit slower than unordered_set but it does not have
huge memory allocation. I do not measure real difference of total time
to finish repair of the same dataset with unordered_set and btree_set.
To fix, switch to absl btree_set container.
Fixes#6190
We had some tests for the number type in Alternator and how it can be
stored, retrieved, calculated and sorted, but only had rudementary tests
for the allowed magnitude and precision of numbers.
This patch creates a new test file, test_number.py, with tests aiming to
check exactly the supported magnitudes and precision of numbers.
These tests verify two things:
1. That Alternator's number type supports the full precision and magnitude
that DynamoDB's number type supports. We don't want to see precision
or magnitude lost when storing and retrieving numbers, or when doing
calculations on them.
2. That Alternator's number type does not have *better* precision or
magnitude than DynamoDB does. If it did, users may be tempted to rely
on that implementation detail.
The three tests of the first type pass; But all four tests of the second
type xfail: Alternator currently stores numbers using big_decimal which
has unlimited precision and almost-unlimited magnitude, and is not yet
limited by the precision and magnitude allowed by DynamoDB.
This is a known issue - Refs #6794 - and these four new xfailing tests
will can be used to reproduce that issue.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200707204824.504877-1-nyh@scylladb.com>
While building unified-deb we first use scylla/reloc/build_deb.sh to create the scylla core package, and after that scylla/reloc/python3/build_deb.sh to create python3.
On 058da69#diff-4a42abbd0ed654a1257c623716804c82 a new rm -rf command was added.
It causes python3 process to erase Scylla-core process.
Set python3 to erase its own dir scylla-python3-package only.
In some cases, tracking the state of processed rows inside `transformer`
is not needd at all. We don't need to do it if either:
- Preimage and postimage are disabled for the table,
- Only preimage is enabled and we are processing the last timestamp.
This commit disables updating the state in the cases listed above.
Moves responsibility for generating pre/postimage rows from the
"process_change" method to "produce_preimage" and "produce_postimage".
This commit actually affects the contents of generated CDC log
mutations.
Added a unit test that verifies more complicated cases with CQL BATCH.
Introduces new methods to the change_processor interface that will cause
it to produce pre/postimage rows for requested clustering key, or for
static row.
Introduces logic in split.cc responsible for calling pre/postimage
methods of the change_processor interface. This does not have any effect
on generated CDC log mutations yet, because the transformer class has
empty implementations in place of those methods.
Instead of looking up preimage data directly from the raw preimage query
results, use the raw results to populate current partition state data,
and read directly from the current partition state.
The function is no longer used in log.cc, so instead it is moved to
split.cc.
Removed declaration of the function from the log.hh header, because it
is not used elsewhere - apart from testing code, but it already
declared find_timestamp in the cdc_test.cc file.
This allows for a more refined use of the transformer by the
for_each_change function (now named "process_changes_with_splitting).
The change_processor interface exposes two methods so far:
begin_timestamp, and process_change (previously named "transform").
By separating those two and exposing them, process_changes_with\
_splitting can cause the transformer to generate less CDC log mutations
- only one for each timestamp in the batch.
Adds a `begin_timestamp` method which tells the `transformer` to start
using the following timestamp and timeuuid when generating new log row
mutations.
Moves tracking of the "touched parts" statistics inside the transformer
class.
This commit is the first of multiple commits in this series which move
parts of the state used in CDC log row generation inside the
`transformer` class. There is a lot of state being passed to
`transformer` each time its methods are called, which could be as well
tracked by the `transformer` itself. This will result in a nicer
interface and will allow us to generate less CDC log mutations which
give the same result.
"
For collections and UDTs the `MIN()` and `MAX()` functions are
generated on the fly. Until now they worked by comparing just the
byte representations of their arguments.
This patch employs specific per-type comparators to provide semantically
sensible, dynamically created aggregates.
Fixes#6768
"
* jul-stas-6768-use-type-comparators-for-minmax:
tests: Test min/max on set
aggregate_fcts: Use per-type comparators for dynamic types
Expected behavior is the lexicographical comparison of sets
(element by element), so this test was failing when raw byte
representations were compared.
For collections and UDTs the `MIN()` and `MAX()` functions are
generated on the fly. Until now they worked by comparing just the
byte representations of arguments.
This patch uses specific per-type comparators to provide semantically
sensible, dynamically created aggregates.
Fixes#6768
* seastar dbecfff5a4...cbf88f59f2 (14):
> future: mark detach_promise noexcept
> net/tls: wait for flush() in shutdown
> httpd: Use handle_exception instead of then_wrapped
> httpd: Use std::unique_ptr instead of a raw pointer
> with_lock: Handle mutable lambdas
> Merge "Make the coroutine implementation a bit more like seastar::thread" from Rafael
> tests: Fix perf fair queue stuck
> util/backtrace: don't get hash from moved simple_backtrace in tasktrace ctor
> scheduling: Allow scheduling_group_get_specific(key) to access elements with queue_is_initialized = false
> prometheus: be compatible with protobuf < 3.4.0
> Merge "fix with_lock error handling" from Benny
> Merge "Simplify the logic for detecting broken promises" from Rafael
> Merge "make scheduling_group functions noexcept" from Benny
> Merge "io_queue: Fixes and reworks for shared fair-queue" from Pavel E
"
This is the first stage of replacing the existing restrictions code with a new representation. It adds a new class `expression` to replace the existing class `restriction`. Lots of the old code is deleted, though not all -- that will come in subsequent stages.
Tests: unit (dev, debug restrictions_test), dtest (next-gating)
"
* dekimir-restrictions-rewrite:
cql3/restrictions: Drop dead code
cql3/restrictions: Use free functions instead of methods
cql3/restrictions: Create expression objects
cql3/restrictions: Add free functions over new classes
cql3/restrictions: Add new representation
Delete unused parts of the old restrictions representation:
- drop all methods, members, and types from class restriction, but
keep the class itself: it's the return type of
relation::to_restriction, which we're keeping intact for now
- drop all subclasses of single_column_restriction and
token_restriction, but keep multi_column_restriction subclasses for
their bounds_ranges method
Keep the restrictions (plural) class, because statement_restrictions
still keeps partition/clustering/other columns in separate
collections.
Move the restriction::merge_with method to primary_key_restrictions,
where it's still being used.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Instead of `restriction` class methods, use the new free functions.
Specific replacement actions are listed below.
Note that class `restrictions` (plural) remains intact -- both its
methods and its type hierarchy remain intact for now.
Ensure full test coverage of the replacement code with new file
test/boost/restrictions_test.cc and some extra testcases in
test/cql/*.
Drop some existing tests because they codify buggy behaviour
(reference #6369, #6382). Drop others because they forbid relation
combinations that are now allowed (eg, mixing equality and
inequality, comparing to NULL, etc.).
Here are some specific categories of what was replaced:
- restriction::is_foo predicates are replaced by using the free
function find_if; sometimes it is used transitively (see, eg,
has_slice)
- restriction::is_multi_column is replaced by dynamic casts (recall
that the `restrictions` class hierarchy still exists)
- utility methods is_satisfied_by, is_supported_by, to_string, and
uses_function are replaced by eponymous free functions; note that
restrictions::uses_function still exists
- restriction::apply_to is replaced by free function
replace_column_def
- when checking infinite_bound_range_deletions, the has_bound is
replaced by local free function bounded_ck
- restriction::bounds and restriction::value are replaced by the more
general free function possible_lhs_values
- using free functions allows us to simplify the
multi_column_restriction and token_restriction hierarchies; their
methods merge_with and uses_function became identical in all
subclasses, so they were moved to the base class
- single_column_primary_key_restrictions<clustering_key>::needs_filtering
was changed to reuse num_prefix_columns_that_need_not_be_filtered,
which uses free functions
Fixes#5799.
Fixes#6369.
Fixes#6371.
Fixes#6372.
Fixes#6382.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
When commitlog is recreated in hints manager, only shutdown() method is
called, but not release(). Because of that, some internal commitlog
objects (`segment_manager` and `segment`s) may be left pointing to each
other through shared_ptr reference cycles, which may result in memory
leak when the parent commitlog object is destroyed.
This PR prevents memory leaks that may happen this way by calling
release() after shutdown() from the hints manager.
Fixes: #6409, Fixes#6776
"
* piodul-fix-commitlog-memory-leak-in-hinted-handoff:
hinted handoff: disable warnings about segments left on disk
hinted handoff: release memory on commitlog termination
When a mutation is written to the commitlog, a rp_handle object is
returned which keeps a reference to commitlog segment. A segment is
"dirty" when its reference count is not zero, otherwise it is "clean".
When commitlog object is being destroyed, a warning is being printed
for every dirty segment. On the other hand, clean segments are deleted.
In case of standard mutation writing path, the rp_handle moves
responsibility for releasing the reference to the memtable to which the
mutation is written. When the memtable is flushed to disk, all
references accumulated in the memtable are released. In this context, it
makes sense to warn about dirty segments, because such segments contain
mutations that are not written to sstables, and need to be replayed.
However, hinted handoff uses a different workflow - it recreates a
commitlog object periodically. When a hint is written to commitlog, the
rp_handle reference is not released, so that segments with hints are not
deleted when destroying the commitlog. When commitlog is created again,
we get a list of saved segments with hints that we can try to send at a
later time.
Although this is intended behavior, now that releasing the hints
commitlog is done properly, it causes the mentioned warning to
periodically appear in the logs.
This patch adds a parameter for the commitlog that allows to disable
this warning. It is only used when creating hinted handoff commitlogs.
When commitlog is recreated in hints manager, only shutdown() method is
called, but not release(). Because of that, some internal commitlog
objects (`segment_manager` and `segment`s) may be left pointing to each
other through shared_ptr reference cycles, which may result in memory
leak when the parent commitlog object is destroyed.
This commit prevents memory leaks that may happen this way by calling
release() after shutdown() from the hints manager.
Fixes: #6409, #6776
Merged patch set from Piotr Sarna:
This series addresses issue #6700 again (it was reopened),
by forbidding all non-local schema changes to be performed
from within the database via CQL interface. These changes
are dangerous since they are not directly propagated to other
nodes.
Tests: unit(dev)
Fixes#6700
Piotr Sarna (4):
test: make schema changes in query_processor_test global
cql3: refuse to change schema internally for distributed tables
test: expand testing internal schema changes
cql3: add explanatory comments to execute_internal
cql3/query_processor.hh | 13 ++++++++++++-
cql3/statements/alter_table_statement.cc | 6 ------
cql3/statements/schema_altering_statement.cc | 15 +++++++++++++++
test/boost/cql_query_test.cc | 8 ++++++--
test/boost/query_processor_test.cc | 16 ++++++++--------
5 files changed, 41 insertions(+), 17 deletions(-)
"coredumpctl info" behavior had been changed since systemd-v232, we need to
support both version.
Before systemd-v232, it was simple.
It print 'Coredump' field only when the coredump exists on filesystem.
Otherwise print nothing.
After the change made on systemd-v232, it become more complex.
It always print 'Storage' field even the coredump does not exists.
Not just available/unavailable, it describe more:
- Storage: none
- Storage: journal
- Storage: /path/to/file (inacessible)
- Storage: /path/to/file
To support both of them, we need to detect message version first, then
try to detect coredump path.
Fixes: #6789
reference: 47f5064207
Instead of running shutil.copy() for each *.{service,default},
create symlink for these files.
Python will copy original file when copying debian directory.
Executing internal CQL queries needs to be done with caution,
since they were designed to be used mainly for local tables
and have very specific semantics wrt. propagating
schema changes. A short comment is added in order to prevent
future misuse of this interface.
Changing the schemas via internal calls to CQL is dangerous,
since the changes are not propagated to other nodes. Thus, it should
never be used for regular distributed tables.
The guarding code was already added for ALTER TABLE statement
and it's now expanded to cover all schema altering statements.
Tests: unit(dev)
Fixes#6700
Modified log message in view_builder::calculate_shard_build_step to make it distinct from the one in view_builder::execute, changed their logging level to warning, since we're continuing even if we handle an exception.
Fixes#4600
WHERE clauses with start point above the end point were handled
incorrectly. When the slice bounds are transformed to interval
bounds, the resulting interval is interpreted as wrap-around (because
start > end), so it contains all values above 0 and all values below
0. This is clearly incorrect, as the user's intent was to filter out
all possible values of a.
Fix it by explicitly short-circuiting to false when start > end. Add
a test case.
Fixes#5799.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
from Botond.
Recently it was observed (#6603) that since 4e6400293ea, the staging
reader is reading from a lot of sstables (200+). This consumes a lot of
memory, and after this reaches a certain threshold -- the entire memory
amount of the streaming reader concurrency semaphore -- it can cause a
deadlock within the view update generation. To reduce this memory usage,
we exploit the fact that the staging sstables are usually disjoint, and
use the partitioned sstable set to create the staging reader. This
should ensure that only the minimum number of sstable readers will be
opened at any time.
Refs: #6603Fixes: #6707
Tests: unit(dev)
* 'view-update-generator-use-partitioned-set/v1' of https://github.com/denesb/scylla:
db/view: view_update_generator: use partitioned sstable set
sstables: make_partitioned_sstable_set(): return an sstable_set
And pass it to `make_range_sstable_reader()` when creating the reader,
thus allowing the incremental selector created therein to exploit the
fact that staging sstables are disjoint (in the case of repair and
streaming at least). This should reduce the memory consumption of the
staging reader considerably when reading from a lot of sstables.
Instead of an `std::unique_ptr<sstable_set_impl>`. The latter doesn't
have a publicly available destructor, so it can only be called from
withing `sstables/compaction_strategy.cc` where its definition resides.
Thus it is not really usable as a public function in its current form,
which shows as it has no users either.
This patch makes it usable by returning an `sstable_set`. That is what
potential callers would want anyway. In fact this patch prepares the
ground for the next one, which wishes to use this function for just
that but can't in its current form.
It is not used any more after 75cf1d18b5
(storage_service: Unify handling of replaced node removal from gossip)
in the "Make replacing node take writes" series.
Refs #5482
In case the tokens for the node to be removed from the cluster are
empty, log the application_state of the leaving node to help understand
why the tokens are empty and try to get the tokens from token_metadata.
Refs #6468
"
This mini-series adds protection against reference loops between tasks,
preventing infinite recursion in this case.
It also contains some other improvements, like updating the task
whitelist as well as the task identification mechanism w.r.t. recent
changes in seastar.
It also improves verbose logging, which was found to not work well while
investigating the other issues fixed herein.
"
* 'scylla-gdb.py-scylla-fiber-update/v1' of https://github.com/denesb/scylla:
scylla-gdb.py: scylla_fiber: add protection against reference loops
scylla-gdb.py: scylla_fiber: relax requirement w.r.t. what object qualifies as task
scylla-gdb.py: scylla_fiber: update whitelist
scylla-gdb.py: scylla_fiber: improve verbose log output
Our JSON legacy helper functions for parsing documents to/from
string maps are indirectly tested by several unit tests, e.g.
caching_options_test.cc. They however lacked one corner case
detected only by dtest - parsing an empty map from a null JSON document.
This case is hereby added in order to prevent future regressions.
Message-Id: <df8243bd083b2ba198df665aeb944c8710834736.1594020411.git.sarna@scylladb.com>
"
Rather than relying on a gate to serialize seek's
background work with close(), change seek() to return a
future<> and wait on it.
Also, now random_access_reader read_exactly(), seek(), and close()
are made noexcept. This will be followed up by making
sstable parse methods noexcept.
Test: unit(dev)
"
* tag 'random_access_reader-v4' of github.com:bhalevy/scylla:
sstables: random_access_reader: make methods noexcept
sstables: random_access_reader: futurize seek
sstables: random_access_reader: unify input stream close code
sstables: random_access_reader: let file_random_access_reader set the input stream
sstables: random_access_reader: move functions out of line
Gossip on_change callback calls storage_service::excise which calls
remove_bootstrap_tokens to remove the tokens of the leaving node from
bootstrap tokens. If empty tokens, e.g., due to gossip propagation issue
as we saw in https://github.com/scylladb/scylla/issues/6468, are passed
to remove_bootstrap_tokens, it will throw. Since the on_change callback
is marked as noexcept, such throw will cause the node to terminate which
is an overkill.
To avoid such error causing the whole cluster to down in worse cases,
just log the tokens are empty passed to remove_bootstrap_tokens.
Refs #6468
Instead, label the mapped volume by passing `:z` options to `-v`
argument, like we do for other mapped volumes in the `dbuild` script.
Passing the `--privileged` flag doesn't work after the most recent
Fedora update and anyway, using `:z` is the proper way to make sure the
mounted volume is accessible. Historically it was needed to be able to
open cores as well, but since 5b08e91bd this is not necessary as the
container is created with SYS_PTRACE capability.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200703072703.10355-1-bdenes@scylladb.com>
handle all exceptions in read_exactly, seek, and close
and specify them as noexcept.
Also, specify eof() as noexcept as it trivially is.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And adjust its callers to wait on the returned future.
With this, there is no need for a gate to serialize close()
with the background work seek() used to leave behind.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define a close_if_needed() helper function, to be called
from seek() and close().
A future patch will call it with a possibly disengaged
`_in` so it will close it only if it was engaged.
close_if_needed() captures the input stream unique ptr
so it will remain valid throughout close.
This was missing from close().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allow file_random_access_reader constructor to set the
input stream to prepare for futurizing seek() by adding
a protected set() method.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
Follow scylla-server package changes, simplified .rpm/.deb build process which merge build scripts into single script.
"
* syuu1228-python3_simplified_pkg_scripts:
python3: simplified .deb build process
python3: simplified .rpm build process
"
This series converts an API to use std::string_view and then converts
a few sstring variables to be constexpr std::string_view. This has the
advantage that a constexpr variables cannot be part of any
initialization order problem.
"
* 'espindola/convert-to-constexpr' of https://github.com/espindola/scylla:
auth: Convert sstring variables in common.hh to constexpr std::string_view
auth: Convert sstring variables in default_authorizer to constexpr std::string_view
cql_test_env: Make ks_name a constexpr std::string_view
class_registry: Use std::string_view in (un)?qualified_name
We could hit "cannot serialize '_io.BufferedReader' object" when request get 404 error from the server
Now you will get legit error message in the case.
Fixes#6690
Print error message and exit with non-zero status by following condition:
- coredumpctl says the coredump file is inaccessible
- failed to detect coredump file path from 'coredumpctl info <pid>'
- deleting coredump file failed because the file is missing
Fixes#6654
Sync version number fixup from main package, contains #6546 and #6752 fixes.
Note that scylla-python3 likely does not affect this versioning issue,
since it uses python3 version, which normally does not contain 'rcX'.
This converts the following variables:
DEFAULT_SUPERUSER_NAME AUTH_KS USERS_CF AUTH_PACKAGE_NAME
Since they are now constexpr they will not be part of any
initialization order problems.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This converts the following variables:
ROLE_NAME RESOURCE_NAME PERMISSIONS_NAME PERMISSIONS_CF
Since they are now constexpr they will not be part of any
initialization order problems.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Merged patch series by Piotr Sarna:
The alternator project was in need of a more optimized
JSON library, which resulted in creating "rjson" helper functions.
Scylla generally used libjsoncpp for its JSON handling, but in order
to reduce the dependency hell, the usage is now migrated
to rjson, which is faster and offers the same functionality.
The original plan was to be able to drop the dependency
on libjsoncpp-lib altogether and remove it from install-dependencies.sh,
but one last usage of it remains in our test suite,
namely cql_repl. The tool compares its output JSON textually,
so it depends on how a library presents JSON - what are the delimeters,
indentation, etc. It's possible to provide a layer of translation
to force rjson to print in an identical format, but the other issue
is that libjsoncpp keeps subobjects sorted by their name,
while rjson uses an unordered structure.
There are two possible solutions for the last remaining usage
of libjsoncpp:
1. change our test suite to compare JSON documents with a JSON parser,
so that we don't rely on internal library details
2. provide a layer of translation which forces rjson to print
its objects in a format idential to libjsoncpp.
(1.) would be preferred, since now we're also vulnerable for changes
inside libjsoncpp itself - if they change anything in their output
format, tests would start failing. The issue is not critical however,
so it's left for later.
Tests: unit(dev), manual(json_test),
dtest(partitioner_tests.TestPartitioner.murmur3_partitioner_test)
Piotr Sarna (8):
alternator,utils: move rjson.hh to utils/
alternator: remove ambiguous string overloads in rjson
rjson: add parse_to_map helper function
rjson: add from_string_map function
rjson: add non-throwing parsing
rjson: move quote_json_string to rjson
treewide: replace libjsoncpp usage with rjson
configure: drop json.cc and json.hh helpers
alternator/base64.hh | 2 +-
alternator/conditions.cc | 2 +-
alternator/executor.hh | 2 +-
alternator/expressions.hh | 2 +-
alternator/expressions_types.hh | 2 +-
alternator/rmw_operation.hh | 2 +-
alternator/serialization.cc | 2 +-
alternator/serialization.hh | 2 +-
alternator/server.cc | 2 +-
caching_options.hh | 9 +-
cdc/log.cc | 4 +-
column_computation.hh | 5 +-
configure.py | 3 +-
cql3/functions/functions.cc | 4 +-
cql3/statements/update_statement.cc | 24 ++--
cql3/type_json.cc | 212 ++++++++++++++++++----------
cql3/type_json.hh | 7 +-
db/legacy_schema_migrator.cc | 12 +-
db/schema_tables.cc | 1 -
flat_mutation_reader.cc | 1 +
index/secondary_index.cc | 80 +++++------
json.cc | 80 -----------
json.hh | 113 ---------------
schema.cc | 25 ++--
test/boost/cql_query_test.cc | 9 +-
test/manual/json_test.cc | 4 +-
test/tools/cql_repl.cc | 1 +
{alternator => utils}/rjson.cc | 75 +++++++++-
{alternator => utils}/rjson.hh | 40 +++++-
29 files changed, 344 insertions(+), 383 deletions(-)
delete mode 100644 json.cc
delete mode 100644 json.hh
rename {alternator => utils}/rjson.cc (86%)
rename {alternator => utils}/rjson.hh (81%)
In order to eventually switch to a single JSON library,
most of the libjsoncpp usage is dropped in favor of rjson.
Unfortunately, one usage still remains:
test/utils/test_repl utility heavily depends on the *exact textual*
format of its output JSON files, so replacing a library results
in all tests failing because of differences in formatting.
It is possible to force rjson to print its documents in the exact
matching format, but that's left for later, since the issue is not
critical. It would be nice though if our test suite compared
JSON documents with a real JSON parser, since there are more
differences - e.g. libjsoncpp keeps children of the object
sorted, while rapidjson uses an unordered data structure.
This change should cause no change in semantics, it strives
just to replace all usage of libjsoncpp with rjson.
Existing infrastructure relies on being able to parse a JSON string
straight into a map of strings. In order to make rjson a drop-in
replacement(tm) for libjsoncpp, a similar helper function is provided.
It's redundant to provide function overloads for both string_view
and const string&, since both of them can be implicitly created from
const char*. Thus, only string_view overloads are kept.
Example code which was ambiguous before the patch, but compiles fine
after it:
rjson::from_string("hello");
Without the patch, one had to explicitly state the type, e.g.:
rjson::from_string(std::string_view("hello"));
which is excessive.
We currently does not able to apply version number fixup for .orig.tar.gz file,
even we applied correct fixup on debian/changelog, becuase it just reading
SCYLLA-VERSION-FILE.
We should parse debian/{changelog,control} instead.
Fixes#6736
Thrift 0.12 includes a change [1] that avoids writing the generated output
if it has not changed. As a result, if you touch cassandra.thrift
(but not change it), the generated files will not update, and as
a result ninja will try to rebuild them every time. The compilation
of thrift files will be fast due to ccache, but still we will re-link
everything.
This touching of cassandra.thrift can happen naturally when switching
to a different git branch and then switching back. The net result
is that cassandra.thrift's contents has not changed, but its timestamp
has.
Fix by adding the "restat" option to the thrift rule. This instructs
ninja to check of the output has changed as expected or not, and to
avoid unneeded rebuilds if it has not.
[1] https://issues.apache.org/jira/browse/THRIFT-4532
It looks like an order version of my patch series was merged. The only
difference is that the new one had more tests. This patch adds the
missing ones.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200630141150.1286893-1-espindola@scylladb.com>
Remember all previously visited tasks and stop if one of them is seen
again. The walk algorithm is converted from recursive to iterative to
facilitate this.
Don't require that the object is located at the start of the allocation
block. Some tasks, like `seastar::internal::when_all_state_component`
might not.
Gdb doesn't seem to handle multiple calls to `gdb.write()` writing to
the same line well, the content of some of these calls just disappears.
So make sure each log message is a separate line and use indentation
instead to illustrate references between messages.
In commit 12d929a5ae (repair: Add table_id
to row_level_repair), a call to find_column_family() was added in
repair_range. In case of the table is dropped, it will fail the
repair_range which in turn fails the bootstrap operation.
Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_test
Fixes: #5942
"
compaction_manager: Avoid stall in perform_cleanup
The following stall was seen during a cleanup operation:
scylla: Reactor stalled for 16262 ms on shard 4.
| std::_MakeUniq<locator::tokens_iterator_impl>::__single_object std::make_unique<locator::tokens_iterator_impl, locator::tokens_iterator_impl&>(locator::tokens_iterator_impl&) at /usr/include/fmt/format.h:1158
| (inlined by) locator::token_metadata::tokens_iterator::tokens_iterator(locator::token_metadata::tokens_iterator const&) at ./locator/token_metadata.cc:1602
| locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at simple_strategy.cc:?
| (inlined by) locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at ./locator/simple_strategy.cc:56
| locator::abstract_replication_strategy::get_ranges(gms::inet_address, locator::token_metadata&) const at /usr/include/fmt/format.h:1158
| locator::abstract_replication_strategy::get_ranges(gms::inet_address) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_ranges_for_endpoint(seastar::basic_sstring<char, unsigned int, 15u, true> const&, gms::inet_address const&) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_local_ranges(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const at /usr/include/fmt/format.h:1158
| (inlined by) operator() at ./sstables/compaction_manager.cc:691
| (inlined by) _M_invoke at /usr/include/c++/9/bits/std_function.h:286
| std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>::operator()(table const&) const at /usr/include/fmt/format.h:1158
| (inlined by) compaction_manager::rewrite_sstables(table*, sstables::compaction_options, std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>) at ./sstables/compaction_manager.cc:604
| compaction_manager::perform_cleanup(table*) at /usr/include/fmt/format.h:1158
To fix, we furturize the function to get sstables. If get_local_ranges()
is called inside a thread, get_local_ranges will yield automatically.
Fixes#6662
"
* asias-compaction_fix_stall_in_perform_cleanup:
compaction_manager: Avoid stall in perform_cleanup
compaction_manager: Return exception future in perform_cleanup
abstract_replication_strategy: Add get_ranges_in_thread
"
Currently, the fast forwarding implementation of the shard reader is
broken in some read-ahead related corner cases, namely:
* If the reader was not created yet, but there is an ongoing read-ahead
(which is going to create it), the function bails out. This will
result in this shard reader not being fast-forwarded to the new range
at all.
* If the reader was already created and there is an ongoing read-ahead,
the function will wait for this to complete, then fast-forward the
reader, as it should. However, the buffer is cleared *before* the
read-ahead is waited for. So if the read-ahead brings in new data,
this will land in the buffer. This data will be outside of the
fast-forwarded-to range and worse, as we just cleared the buffer, it
might violate mutation fragment stream monotonicity requirements.
This series fixes these two bugs and adds a unit test which reproduces
both of them.
There are no known field issues related to these bugs. Only row-level
repair ever fast-forwards the multishard reader, but it only uses it in
heterogenous clusters. Even so, in theory none of these bugs affect
repair as it doesn't ever fast-forward the multishard reader before all
shards arrive at EOS.
The bugs were found while auditing the code, looking for the possible
cause of #6613.
Fixes: #6715
Tests: unit(dev)
"
* 'multishard-combining-reader-fast-forward-to-fixes/v1.1' of https://github.com/denesb/scylla:
test/boost/mutation_reader_test: multishard reader: add tests for fast-forwarding with read-ahead
test/boost/mutation_reader_test: extract multishard read-ahead test setup
test/boost/mutation_reader_test: puppet_reader: fast-forward-support
mutation_reader_test: puppet_reader: make interface more predictable
dht::sharder: add virtual destructor
mutation_reader: shard_reader: fix fast-forwarding with read-ahead
Testing the multishard reader's various read-ahead related corner cases
requires a non-trivial setup. Currently there is just one such test,
but we plan to add more so in this patch we extract this setup code to a
free function to allow reuse across multiple tests.
A fast-forwarded puppet reader goes immediately to EOS. A counter is
added to the remote control to allow tests to check which readers were
actually fast forwarded.
Currently the puppet reader will do an automatic (half) buffer-fill in
the constructor. This makes it very hard to reason about when and how
the action that was passed to it will be executed. Refactor it to take a
list of actions and only execute those, no hidden buffer-fill anymore.
No better proof is needed for this than the fact that the test which is
supposed to test the multishard reader being destroyed with a pending
read-ahead was silently broken (not testing what it should).
This patch fixes this test too.
Also fixed in this patch is the `pending` and `destroyed` fields of the
remote control, tests can now rely on these to be correct and add
additional checkpoints to ensure the test is indeed doing what it was
intended to do.
The following stall was seen during a cleanup operation:
scylla: Reactor stalled for 16262 ms on shard 4.
| std::_MakeUniq<locator::tokens_iterator_impl>::__single_object std::make_unique<locator::tokens_iterator_impl, locator::tokens_iterator_impl&>(locator::tokens_iterator_impl&) at /usr/include/fmt/format.h:1158
| (inlined by) locator::token_metadata::tokens_iterator::tokens_iterator(locator::token_metadata::tokens_iterator const&) at ./locator/token_metadata.cc:1602
| locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at simple_strategy.cc:?
| (inlined by) locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at ./locator/simple_strategy.cc:56
| locator::abstract_replication_strategy::get_ranges(gms::inet_address, locator::token_metadata&) const at /usr/include/fmt/format.h:1158
| locator::abstract_replication_strategy::get_ranges(gms::inet_address) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_ranges_for_endpoint(seastar::basic_sstring<char, unsigned int, 15u, true> const&, gms::inet_address const&) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_local_ranges(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const at /usr/include/fmt/format.h:1158
| (inlined by) operator() at ./sstables/compaction_manager.cc:691
| (inlined by) _M_invoke at /usr/include/c++/9/bits/std_function.h:286
| std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>::operator()(table const&) const at /usr/include/fmt/format.h:1158
| (inlined by) compaction_manager::rewrite_sstables(table*, sstables::compaction_options, std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>) at ./sstables/compaction_manager.cc:604
| compaction_manager::perform_cleanup(table*) at /usr/include/fmt/format.h:1158
To fix, we furturize the function to get local ranges and sstables.
In addition, this patch removes the dependency to global storage_service object.
Fixes#6662
The current `fast_forward_to(const dht::partition_range&)`
implementation has two problems:
* If the reader was not created yet, but there is an ongoing read-ahead
(which is going to create it), the function bails out. This will
result in this shard reader not being fast-forwarded to the new range
at all.
* If the reader was already created and there is an ongoing read-ahead,
the function will wait for this to complete, then fast-forward the
reader, as it should. However, the buffer is cleared *before* the
read-ahead is waited for. So if the read-ahead brings in new data,
this will land in the buffer. This data will be outside of the
fast-forwarded-to range and worse, as we just cleared the buffer, it
might violate mutation fragment stream monotonicity requirements.
This patch fixes both of these bugs. Targeted reproducer unit tests are
coming in the next patches.
When we started to porting bash script to python script, we are not able to use
subprocess.run() since EPEL only provides python 3.4, but now we have
relocatable python, so we can switch to it.
* seastar 5db34ea8d3...dbecfff5a4 (3):
> sharded: Do not hang on never set freed promise
Fixes#6606.
> foreign_ptr: make constructors and methods conditionally noexcept
> foreign_ptr: specify methods as noexcept
Currently the section on "Debugging coredumps" only briefly mentions
relocatable binaries, then starts with an extensive subsection on how to
open cores generated by non-relocatable binaries. There is a subsection
about relocatable binaries, but it just contains some out-of-date
workaround without any context.
In this patch we completely replace this outdated and not very useful
subsection on relocatable binaries, with a much more extensive one,
documenting step-by-step a procedure that is known to work. Also, this
subsection is moved above the non-relocatable one. All our current
releases except for 2019.1 use relocatable binaries, so the subsection
about these should be the more prominent one.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200630145655.159926-1-bdenes@scylladb.com>
* seastar 11e86172ba...5db34ea8d3 (7):
> scollectd: Avoid a deprecated warning
> prometheus: Avoid protobuf deprecated warning
> Merge "Avoid abandoned futures in tests" from Rafael
> Merge "make circular_buffer methods noexcept" from Benny
> futures_test: Wait for future in test
> iotune: Report disk IOPS instead of kernel IOPS
> io_tester: Ability to add dsync option to open_file_dma
needs_cleanup() returns true if a sstable needs cleanup.
Turns out it's very slow because it iterates through all the local
ranges for all sstables in the set, making its complexity:
O(num_sstables * local_ranges)
We can optimize it by taking into account that abstract_replication_strategy
documents that get_ranges() will return a list of ranges that is sorted
and non-overlapping. Compaction for cleanup already takes advantage of that
when checking if a given partition can be actually purged.
So needs_cleanup() can be optimized into O(num_sstables * log(local_ranges)).
With num_sstables=1000, RF=3, then local_ranges=256(num_tokens)*3, it means
the max # of checks performed will go from 768000 to ~9584.
Fixes#6730.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-2-raphaelsc@scylladb.com>
"
Recent branch introduced uncaught by regular snapshot tests issue
with snapshots listing via API, this set fixes it (patch #1),
cleans the indentation after the fix (#2) and polishes the way
stuff is captured nearby (#3)
Tests: dtest(nodetool_additional_tests)
"
* 'br-fix-snap-get-details' of https://github.com/xemul/scylla:
api: Remove excessive capture
api: Fix indentation after previous patch
api: Fix wrongly captured map of snapshots
Currently fmt::print is used to print an error message
if (broadcast_address != listen && seeds.count(listen))
and the logger should be used instead.
While at it, the information printed in this message is valueable
also in the error-free case, so this change logs it at `info`
level and then logs an error without repeating the said info.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Test: bootstrap_test.py:TestBootstrap.start_stop_test_node(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200630083826.153326-1-bhalevy@scylladb.com>
"
These patches removes the last few uses of variadic futures in scylla.
"
* 'espindola/no-variadic-future' of https://github.com/espindola/scylla:
row level repair: Don't return a variadic future from get_sink_source
row level repair: Don't return a variadic future from read_rows_from_disk
messaging_service: Don't return variadic futures from make_sink_and_source_for_*
cql3: Don't use variadic futures in select_statement
"
Offstrategy work, on boot and refresh, guarantees that a shared SSTable
will not reach the table whatsoever. We have lots of extra code in
table to make it able to live with those shared SSTables.
Now we can fortunately get rid of all that code.
tests: mode(dev).
also manually tested it by triggering resharding both on boot/refresh.
"
* 'cleanup_post_offstrategy_v2' of https://github.com/raphaelsc/scylla:
distributed_loader: kill unused invoke_shards_with_ptr()
sstables:: kill unused sstables::sstable_open_info
sstables: kill unused sstable::load_shared_components()
distributed_loader: remove declaration of inexistent do_populate_column_family()
table: simplify table::discard_sstables()
table: simplify add_sstable()
table: simplify update_stats_for_new_sstable()
table: remove unused open_sstable function
distributed_loader: remove unused code
table: no longer keep track of sstables that need resharding
table: Remove unused functions no longer used by resharding
table: remove sstable::shared() condition from backlog tracker add/remove functions
table: No longer accept a shared SSTable
This patch set adds a few new features in order to fix issue
The list of changes is briefly as follows:
- Add a new `LWT` flag to `cql3::prepared_metadata`,
which allows clients to clearly distinguish betwen lwt and
non-lwt statements without need to execute some custom parsing
logic (e.g. parsing the prepared query with regular expressions),
which is obviously quite fragile.
- Introduce the negotiation procedure for cql protocol extensions.
This is done via `cql_protocol_extension` enum and is expected
to have an appropriate mirroring implementation on the client
driver side in order to work properly.
- Implmenent a `LWT_ADD_METADATA_MARK` cql feature on top of the
aforementioned algorithm to make the feature negotiable and use
it conditionally (iff both server and client agrees with each
other on the set of cql extensions).
The feature is meant to be further utilized by client drivers
to use primary replicas consistently when dealing with conditional
statements.
* git@github.com:ManManson/scylla feature/lwt_prepared_meta_flag_2:
lwt: introduce "LWT" flag in prepared statement metadata
transport: introduce `cql_protocol_extension` enum and cql protocol extensions negotiation
Currently we we mistakenly made two different way to detect distribution,
directly reading /etc/os-release and use distro package.
distro package provides well abstracted APIs and still have full access to
os-release informations, we should switch to it.
Fixes#6691
After commit 7d86a3b208 (storage_service:
Make replacing node take writes), during replace operation, tokens in
_token_metadata for node being replaced are updated only after the replace
operation is finished. As a result, in range_streamer::add_ranges, the
node being replaced will be considered as a source to stream data from.
Before commit 7d86a3b208, the node being
replaced will not be considered as a source node because it is already
replaced by the replacing node before the replace operation is finished.
This is the reason why it works in the past.
To fix, filter out the node being replaced as a source node explicitly.
Tests: replace_first_boot_test and replace_stopped_node_test
Backports: 4.1
Fixes: #6728
get_shards_for_this_sstable() can be called inside table::add_sstable()
because the shards for a sstable is precomputed and so completely
exception safe. We want a central point for checking that table will
no longer added shared SSTables to its sstable set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
no longer need to conditionally track the SSTable metadata,
as table will no longer accept shared SSTables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Now that table will no longer accept shared SSTables, it no longer
needs to keep track of them.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Now that table no longer accept shared SSTables, those two functions can
be simplified by removing the shared condition.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With off-strategy work on reshard on boot and refresh, table no
longer needs to work with Shared SSTables. That will unlock
a host of cleanups.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The results of get_snapshot_details() is saved in do_with, then is
captured on the json callback by reference, then the do_with's
future returns, so by the time callback is called the map is already
free and empty.
Fix by capturing the result directly on the callback.
Fixes recently merged b6086526.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch adds a new `LWT` flag to `cql3::prepared_metadata`.
That allows clients to clearly distinguish betwen lwt and
non-lwt statements without need to execute some custom parsing
logic (e.g. parsing the prepared query with regular expressions),
which is obviously quite fragile.
The feature is meant to be further utilized by client drivers
to use primary replicas consistently when dealing with conditional
statements.
Whether to use lwt optimization flag or not is handled by negotiation
procedure between scylla server and client library via SUPPORTED/STARTUP
messages (`LWT_ADD_METADATA_MARK` extension).
Tests: unit(dev, debug), manual testing with modified scylla/gocql driver
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The tests in test_projection_expression.py test that ProjectionExpression
works - including attribute paths - for the GetItem, Query and Scan
operations.
There is a fourth read operation - BatchGetItem, and it supports
ProjectionExpression too. We tested BatchGetItem + ProjectionExpression in
test_batch.py, but this only tests the basic feature, with top-level
attributes, and we were missing a test for nested document paths.
This patch adds such a test. It is still xfailing on Alternator (and passing
on DynamoDB), because attribute paths are still not supported (this is
issue #5024).
Refs #5024.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200629063244.287571-1-nyh@scylladb.com>
This patch adds three more tests for the ProjectionExpression parameter
of GetItem. They are tests for nested document paths like a.b[2].c.
We don't support nested paths in Alternator yet (this is issue #5024),
so the new tests all xfail (and pass on DynamoDB).
We already had similar tests for UpdateExpression, which also needs to
support document paths, but the tests were missing for ProjectionExpression.
I am planning to start the implementation of document paths with
ProjectionExpression (which is the simplest use of document paths), so I
want the tests for this expression to be as complete as possible.
Refs #5024.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200628213208.275050-1-nyh@scylladb.com>
"
The snapshotting code is already well isolated from the rest of
the storage_service, so it's relatively easy to move it into
independent component, thus de-bloating the storage_service.
As a side effect this allows painless removal of calls to global
get_storage_service() from schema::describe code.
Test: unit(debug), dtest.snapshot_test(dev), manual start-stop
"
* 'br-snapshot-controller-4' of https://github.com/xemul/scylla:
snap: Get rid of storage_service reference in schema.cc
main: Stop http server
snapshot: Make check_snapshot_not_exist a method
snapshots: Move ops gate from storage_service
snapshot: Move lock from storage_service
snapshot: Move all code into db::snapshot_ctl class
storage_service: Move all snapshot code into snapshot-ctl.cc
snapshots: Initial skeleton
snapshots: Properly shutdown API endpoints
api: Rewrap set_server_snapshot lambda
We are currently investigating a segmentation fault, which is suspected
to be caused by a leaked pause handle. Although according to the latest
theory the handle leak is not the root cause of the issue, just a
symptom, its better to catch any bugs that would cause a handle leaking
at the act, and not later when some side-effect causes a segfault.
Refs: #6613
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200625153729.522811-1-bdenes@scylladb.com>
"
Before this series scylla would effectively infinite loop when, for
example, casting a decimal with a negative scale to float.
Fixes#6720
"
* 'espindola/fix-decimal-issue' of https://github.com/espindola/scylla:
big_decimal: Add a test for a corner case
big_decimal: Correctly handle negative scales
big_decimal: Add a as_rational member function
big_decimal: Move constructors out of line
As HACKING.md suggests, we now require gcc version >= 10. Set the
minimum at 10.1.1, as that is the first official 10 release:
https://gcc.gnu.org/releases.html
Tests: manually run configure.py and ensure it passes/fails
appropriately.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Now when the snapshot stopping is correctly handled, we may pull the database
reference all the way down to the schema::describe().
One tricky place is in table::napshot() -- the local db reference is pulled
through an smp::submit_to call, but thanks to the shard checks in the place
where it is needed the db is still "local"
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently it's not stopped at all, so calling a REST request shutdown-time
may crash things at random places.
Fixes: #5702
But it's not the end of the story. Since the server stays up while we are
shutting things down, each subsystem should carefully handle the cases when
it's half-down, but a request comes. A better solution is to unregister
rest verbs eventually, but httpd's rules cannot do it now.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For this de-static run_snapshot_*_operation (because we no longer have
the static global to get the lock from) and make the snapshot_ctl be
peering_sharded_service to call invoke_on.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This includes
- rename namespace in snapshot-ctl.[cc|hh]
- move methods from storage_service to snapshot_ctl
- move snapshot_details struct
- temporarily make storage_service._snapshot_lock and ._snapshot_ops public
- replace two get_local_storage_service() occurrences with this._db
The latter is not 100% clear as the code that does this references "this"
from another shard, but the _db in question is the distributed object, so
they are all the same on all instances.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is plain move, no other modifications are made, even the
"service" namespace is kept, only few broken indentation fixes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A placeholder for snapshotting code that will be moved into it
from the storage_service.
Also -- pass it through the API for future use.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now with the seastar httpd routes unset() at hands we
can shut down individual API endpoints. Do this for
snapshot calls, this will make snapshot controller stop
safe.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lambda calls the core snapshot method deep inside the
json marshalling callback. This will bring problems with
stopping the snapshot controller in the next patches.
To prepare for this -- call the .get_snapshot_details()
first, then keep the result in do_with() context. This
change doesn't affect the issue the lambde in question is
about to solve as the whole result set is anyway kept in
memory while being streamed outside.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add expression as a member of restriction. Create or update
expression everywhere restrictions are created or updated.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
These new classes will replace the existing restrictions hierarchy.
Instead of having member functions, they expose their data publicly,
for future free functions to operate on.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
This behavior is different from cassandra, but without arithmetic
operations it doesn't seem possible to notice the difference from
CQL. Using avg produces the same results, since we use an initial
value of 0 (scale = 0).
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
A negative scale was being passed an a positive value to
boost::multiprecision::pow, which would never finish.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Before Scylla 3.0, we used to send streaming mutations using
individual RPC requests and flush them together using dedicated
streaming memtables. This mechanism is no longer in use and all
versions that use it have long reached end-of-life.
Remove this code.
This makes the code a bit easier to read as there are no discarded
futures and no references to having to keep a subscription alive,
which we don't with current seastar.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200527013120.179763-1-espindola@scylladb.com>
"
Row level repair, when using a local reader, is prone to deadlocking on
the streaming reader concurrency semaphore. This has been observed to
happen with at least two participating nodes, running more concurrent
repairs than the maximum allowed amount of reads by the concurrency
semaphore. In this situation, it is possible that two repair instances,
competing for the last available permits on both nodes, get a permit on
one of the nodes and get queued on the other one respectively. As
neither will let go of the permit it already acquired, nor give up
waiting on the failed-to-acquired permit, a deadlock happens.
To prevent this, we make the local repair reader evictable. For this we
reuse the already existing evictable reader mechanism of the multishard
combining reader. This patchset refactors this evictable reader
mechanism into a standalone flat mutation reader, then exposes it to the
outside world.
The repair reader is paused after the repair buffer is filled, which is
currently 32MB, so the cost of a possible reader recreation is amortized
over 32MB read.
The repair reader is said to be local, when it can use the shard-local
partitioner. This is the case if the participating nodes are homogenous
(their shard configuration is identical), that is the repair instance
has to read just from one shard. A non-local reader uses the multishard
reader, which already makes its shard readers evictable and hence is not
prone to the deadlock described here.
Fixes: #6272
Tests: unit(dev, release, debug)
"
* 'repair-row-level-evictable-local-reader/v3' of https://github.com/denesb/scylla:
repair: row_level: destroy reader on EOS or error
repair: row_level: use evictable_reader for local reads
mutation_reader: expose evictable_reader
mutation_reader: evictable_reader: add auto_pause flag
mutation_reader: make evictable_reader a flat_mutation_reader
mutation_reader: s/inactive_shard_read/inactive_evictable_reader/
mutation_reader: move inactive_shard_reader code up
mutation_reader: fix indentation
mutation_reader: shard_reader: extract remote_reader as evictable_reader
mutation_reader: reader_lifecycle_policy: make semaphore() available early
The database has a mechanism of performing internal CQL queries,
mainly to edit its own local tables. Unfortunately, it's easy
to use the interface incorrectly - e.g. issuing an `ALTER TABLE`
statement on a non-local table will result in not propagating
the schema change to other nodes, which in turn leads to
inconsistencies. In order to avoid such mistakes (one of them
was a root cause of #6513), when an attempt to alter a distributed
table via a local interface is performed, it results in an error.
Tests: unit(dev)
Fixes#6700
Message-Id: <61be3defb57be79f486e6067ceff4f4c965e34cb.1592990796.git.sarna@scylladb.com>
The function that determines if a level L, where L > 0, is disjoint,
is returning false if level is disjoint.
That's because it incorrectly accounts an overlapping SSTable in
the level as a disjoint SSTable. So we need to inverse the logic.
The side effect is that boot will always try to reshape levels
greater than 0 because reshape procedure incorrectly thinks that
levels are overlapping when they're actually disjoint.
Fixes#6695.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200623180221.229695-1-raphaelsc@scylladb.com>
Currently the message only mentions the endpoint and the error message
returned from the replica. Add the keyspace and table to this message to
provide more context. This should help investigations of such errors
greatly, as in the case of tests where there is usually a single table,
we can already guess what exactly is timing out based on this.
We should add even more context, like the kind of the query (single
partition or range scan) but this information is not readily available
in the surrounding scope so this patch defers it.
Refs: #6548
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200624054647.413256-1-bdenes@scylladb.com>
Manual translation from JSON to string_view is replaced
with rjson::to_string_view helper function. In one place,
a redundant string_view intermediary is removed
in favor of creating the string straight from JSON.
Message-Id: <2aa9d9fedd73f14b7640870d14db4f2f0bd7bd8a.1592936139.git.sarna@scylladb.com>
Updating tags was erroneously done locally, which means that
the schema change was not propagated to other nodes.
The new code announces new schema globally.
Fixes#6513
Branches: 4.0,4.1
Tests: unit(dev)
dtest(alternator_tests.AlternatorTest.test_update_condition_expression_and_write_isolation)
Message-Id: <3a816c4ecc33c03af4f36e51b11f195c231e7ce1.1592935039.git.sarna@scylladb.com>
To avoid having to make it an optional with all the additional checks,
we just replace it with an empty reader instead, this also also achieves
the desired effect of releasing the read permit and all the associated
resources early.
Row level repair, when using a local reader, is prone to deadlocking on
the streaming reader concurrency semaphore. This has been observed to
happen with at least two participating nodes, running more concurrent
repairs than the maximum allowed amount of reads by the concurrency
semaphore. In this situation, it is possible that two repair instances,
competing for the last available permits on both nodes, get a permit on
one of the nodes and get queued on the other one respectively. As
neither will let go of the permit it already acquired, nor give up
waiting on the failed-to-acquired permit, a deadlock happens.
To prevent this, we make the local repair reader evictable. For this we
reuse the newly exposed evictable reader.
The repair reader is paused after the repair buffer is filled, which is
currently 32MB, so the cost of a possible reader recreation is amortized
over 32MB read.
The repair reader is said to be local, when it can use the shard-local
partitioner. This is the case if the participating nodes are homogenous
(their shard configuration is identical), that is the repair instance
has to read just from one shard. A non-local reader uses the multishard
reader, which already makes its shard readers evictable and hence is not
prone to the deadlock described here.
Expose functions for the outside world to create evictable readers. We
expose two functions, which create an evictable reader with
`auto_pause::yes` and `auto_pause::no` respectively. The function
creating the latter also returns a handle in addition to the reader,
which can be used to pause the reader.
Currently the evictable reader unconditionally pauses the underlying
reader after each use (`fill_buffer()` or `fast_forward_to()` call).
This is fine for current users (the multishard reader), but the future
user we are doing all this refactoring for -- repair -- will want to
control when the underlying reader is paused "manually". Both these
behaviours can easily be supported in a single implementation, so we
add an `auto_pause` flag to allow the creator of the evictable reader
to control this.
The `evictable_reader` class is almost a proper flat mutation reader
already, it roughly offers the same interface. This patch makes this
formal: changing the class to inherit from `flat_mutation_reader::impl`,
and implement all virtual methods. This also entails a departure from
using the lifecycle policy to pause/resume and create readers, instead
using more general building blocks like the reader concurrency semaphore
and a mutation source.
Unlike refresh on upload dir, column family population shouldn't mutate
level of SSTables to level 0. Otherwise, LCS will have to regenerate all
levels by rewriting the data multiple times, hurting a lot the write
amplification and consequently the node performance. That's also affecting
the time for a node to boot because reshape may be triggered as a result
of this.
Refs #6695.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200622192502.187532-2-raphaelsc@scylladb.com>
The thrift compiler (since 0.13 at least) complains that
the csharp target is deprecated and recommend replacing it
with netstd. Since we don't use either, humor it.
I suspect that this warning caused some spurious rebuilds,
but have not proven it.
Pager belongs to a different layer than CQL and thus should not be
coupled with CQL stats - if any different frontends want to use paging,
they shouldn't be forced to instantiate CQL stats at all.
Same goes with CQL restrictions, but that will require much bigger
refactoring, so is left for later.
Message-Id: <5585eb470949e3457334ffd6dba80742abf3a631.1592902295.git.sarna@scylladb.com>
In the section explaining how to build a docker image for a self-built
Scylla executable, we have a warning that even if you already built
Scylla, build_reloc.sh will re-run configure.py and rebuild the executable
with slightly different options.
The re-run of configure.py and ninja still happens (see issue #6547) but
we no longer pass *different* options to configure.py, so the rebuild
usually doesn't do anything and finishes in seconds, and the paragraph
warning about the rebuild is no longer relevant.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200621093049.975044-1-nyh@scylladb.com>
* seastar a6c8105443...7664f991b9 (13):
> gate: add try_enter and try_with_gate
> Merge "Manage reference counts in the file API" from Rafael
> cmake: Refactor a bit of duplicated code
> stream: Delete _sub
> future: Add a rethrow_exception to future_state_base
> future: Use a new seastar::nested_exception in finally
> cmake: only apply C++ compile options to C++ language
> testing: Enable fail-on-abandoned-failed-futures by default
> future: Correct a few hypercorrect uses of std::forward
> futures_test: Test using future::then with functions
> Merge "io-queue: A set of cleanups collected so far" from Pavel E
> tmp_file: Replace futurize_apply with futurize_invoke
> future: Replace promise::set_coroutine with forward_state_and_schedule
Contains update to tests from Rafael:
tests: Update for fail-on-abandoned-failed-futures's new default
This depends on the corresponding change in seastar.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Rename `inactive_shard_read` to `inactive_evictable_reader` to reflect
that the fact that the evictable reader is going to be of general use,
not specific to the multishard reader.
We want to make the evictable reader mechanism used in the multishard
reader pipeline available for general (re)use, as a standalone
flat mutation reader implementation. The first step is extracting
`shard_reader::remote_reader` the class implementing this logic into a
top-level class, also renamed to `evictable_reader`.
Currently all reader lifecycle policy implementations assume that
`semaphore()` will only be called after at least one call to
`make_reader()`. This assumption will soon not hold, so make sure
`semaphore()` can be called at any time, including before any calls are
made to `make_reader()`.
On Ubuntu 18.04 and ealier & Deiban 10 and ealier, /usr merge is not done, so
/usr/bin/systemd-escape and /bin/systemd-escape is different place, and we call
/usr/bin but Debian variants tries to install the command in /bin.
Drop full path, just call command name and resolve by default PATH.
Fixes: #6650
LCS reshape job may pick a wrong level because we iterate through
levels from index 1 and stop the iteration as soon as the current
level is NOT disjoint, so it happens that we never reach the upper
levels, meaning the level of the first NOT disjoint level is used,
and not the actual maximum filled level. That's fixed by doing
the iteration in the inverse order.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200618154112.8335-1-raphaelsc@scylladb.com>
Streaming is handled by just once group for CPU scheduling, so
separating it into read and write classes for I/O is artificial, and
inflates the resources we allow for streaming if both reads and writes
happen at the same time.
Merge both classes into one class ("streaming") and adjust callers. The
merged class has 200 shares, so it reduces streaming bandwidth if both
directions are active at the same time (which is rare; I think it only
happens in view building).
Retrying the operation of fetching generations not always makes
sense. In this patch only the lightest exceptions (timeout and
unavailable) trigger retrying, while the heavy, unrecoverable ones
abort the operation and get logged on ERROR level.
Fixes#6557
SSTable upgrade is requiring 2x the space of input SSTables because
we aren't releasing references of the SSTables that were already
upgraded. So if we're upgrading 1TB, it means that up to 2TB may be
required for the upgrade operation to succeed.
That can be fixed by moving all input SSTables when rewrite_sstables()
asks for the set of SSTables to be compacted, so allowing their space
to be released as soon as there is no longer any ref to them.
Spotted while auditting code.
Fixes#6682.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200619205701.92891-1-raphaelsc@scylladb.com>
Now every tests starts by deferring a call to
await_background_jobs. That can be verified with:
$ git grep -B 1 await_background test/boost/sstable_3_x_test.cc | grep THREAD | wc -l
90
$ git grep -A 1 SEASTAR_THREAD_TEST_CASE test/boost/sstable_3_x_test.cc | grep await_background | wc -l
90
Thanks to Raphael Carvalho for noticing it.
Refs #6624
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200619220048.1091630-1-espindola@scylladb.com>
after e40aa042a7, auto compaction is explicitly disabled on all
tables being populated and only enabled later on in the boot
process. we forgot to update cql_test_env to also reenable
auto compaction, so unit tests based on cql_test_env were not
compacting at all.
database_test, for example, was running out of file descriptors
because the number kept growing unboundly due to lack of compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200618225621.15937-1-raphaelsc@scylladb.com>
The call to `verify_owner_and_mode` from `flush_upload_dir`
fell between the cracks in b34c0c2ff6
(distributed_loader: rework uploading of SSTables).
It causes https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/528/testReport/nodetool_additional_test/TestNodetool/nodetool_refresh_with_wrong_upload_modes_test/
to fail like this:
```
/Directory cannot be accessed .* write/ not found in 'Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/7351db7cab7bbf907172940d0bbf8b90afde90ba/scylla-tools-java/bin/nodetool -h 127.0.87.1 -p 7187 refresh -- keyspace1 standard1' failed; exit status: 1; stdout: nodetool: Scylla API server HTTP POST to URL '/storage_service/sstables/keyspace1' failed: Failed to load new sstables: std::filesystem::__cxx11::filesystem_error (error system:13, filesystem error: remove failed: Permission denied [/jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-rqzo7km7/test/node1/data/keyspace1/standard1-8a57a660b29611eabf0c000000000000/upload/mc-3-big-TOC.txt])
```
Reenable it in this patch makes the dtest pass again.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200621140439.85843-1-bhalevy@scylladb.com>
We already have a docker image option to enable alternator on an unencrypted
port, "--alternator-port", but we forgot to also allow the similar option
for enabling alternator on an encrypted (HTTPS) port: "--alternator-https-port"
so this patch adds the missing option, and documents how to use it.
Note that using this option is not enough. When this option is used,
Alternator also requires two files, /etc/scylla/scylla.crt and
/etc/scylla/scylla.key, to be inserted into the image. These files should
contain the SSL certificate, and key, respectively. If these files are
missing, you will get an error in the log about the missing file.
Fixes#6583.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200621125219.12274-1-nyh@scylladb.com>
"
This patchset adds a reshape operation to each compaction strategy;
that is a strategy-specific way of detecting if SSTables are in-strategy
or off-strategy, and in case they are offstrategy moving them to in-strategy.
Often times the number of SSTables in a particular slice of the sstable set
matters for that decision (number of SSTables in the same time window for TWCS,
number of SSTables per tier for STCS, number of L0 SSTables for LCS). We want
to be more lenient for operations that keep the node offline, like reshape at
boot, but more forgiving for operations like upload, which run in maintenance
mode. To accomodate for that the threshold for considering a slice of the SSTable
set offstrategy is passed as a parameter
Once this patchset is applied, the upload directory will reshape the SSTables
before moving them to the main directory (if needed). One side effect of it
is that it is no longer necessary to take locks for the refresh operation nor
disable writes in the table.
With the infrastructure that we have built in the upload directory, we can
apply the same set of steps to populate_column_family. Using the sstable_directory
to scan the files we can reshard and reshape (usually if we resharded a reshape
will be necessary) with the node still offline. This has the benefit of never
adding shared SSTables to the table.
Applying this patchset will unlock a host of cleanups:
- we can get rid of all testing for shared sstables, sstable_need_rewrite, etc.
- we can remove the resharding backlog tracker.
and many others. Most cleanups are deferred for a later patchset, though.
"
* 'reshard-reshape-v4' of github.com:glommer/scylla:
distributed_loader: reshard before the node is made online
distributed_loader: rework uploading of SSTables
sstable_directory: add helper to reshape existing unshared sstables
compaction_strategy: add method to reshape SSTables
compaction: add a new compaction type, Reshape
compaction: add a size and throught pretty printer.
compaction: add default implementation for some pure functions
tests: fix fragile database tests
distributed_loader.cc: add a helper function to extract the highest SSTable version found
distributed_loader.cc : extract highest_generation_seen code
compaction_manager: rename run_resharding_job
distributed_loader: assume populate_column_families is run in shard 0
api: do not allow user to meddle with auto compaction too early
upload: use custom error handler for upload directory
sstable_directory: fix debug message
This patch moves the resharding process to use the new
directory_with_sstables_handler infrastructure. There is no longer
a clear reshard step, and that just becomes a natural part of
populate_column_family.
In main.cc, a couple of changes are necessary to make that happen.
The first one obviously is to stop calling reshard. We also need to
make sure that:
- The compaction manager is started much earlier, so we can register
resharding jobs with it.
- auto compactions are disabled in the populate method, so resharding
doesn't have to fight for bandwidth with auto compactions.
Now that we are resharding through the sstable_directory, the old
resharding code can be deleted. There is also no need to deal with
the resharding backlog either, because the SSTables are not yet
added to the sstable set at this point.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Uploading of SSTables is problematic: for historical reasons it takes a
lock that may have to wait for ongoing compactions to finish, then it
disables writes in the table, and then it goes loading SSTables as if it
knew nothing about them.
With the sstable_directory infrastructure we can do much better:
* we can reshard and reshape the SSTables in place, keeping the number
of SSTables in check. Because this is an background process we can be
fairly aggressive and set the reshape mode to strict.
* we can then move the SSTables directly into the main directory.
Because we know they are few in number we can call the more elegant
add_sstable_and_invalidate_cache instead of the open coding currently
done by load_new_sstables
* we know they are not shared (if they were, we resharded them),
simplifying the load process even further.
The major changes after this patch is applied is that all compactions
(resharding and reshape) needed to make the SSTables in-strategy are
done in the streaming class, which reduces the impact of this operation
on the node. When the SSTables are loaded, subsequent reads will not
suffer as we will not be adding shared SSTables in potential high
numbers, nor will we reshard in the compaction class.
There is also no more need for a lock in the upload process so in the
fast path where users are uploading a set of SSTables from a backup this
should essentially be instantaneous. The lock, as well as the code to
disable and enable table writes is removed.
A future improvement is to bypass the staging directory too, in which
case the reshaping compaction would already generate the view updates.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Before moving SSTables to the main directory, we may need to reshape them
into in-strategy. This patch provides helper code that reshapes the SSTables
that are known to be unshared local in the sstable directory, and updates the
sstable directory with the result.
Rehaping can be made more or less aggressive by passing a reshape mode
(relaxed or strict), which will influence the amount of SSTables reshape
can tolerate to consider a particular slice of the SSTable set
offstrategy.
Because the compaction expects an std::vector everywhere, we changed
our chunked vector for the unshared sstables to a std::vector so we
can more easily pass it around without conversions.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Some SSTable sets are considered to be off-strategy: they are in a shape
that is at best not optimal and at worst adversarial to the current
compaction strategy.
This patch introduces the compaction strategy-specific method
get_reshaping_job(). Given an SSTable set, it returns one compaction
that can be done to bring the table closer to being in-strategy. The
caller can then call this repeatedly until the table is fully
in-strategy.
As an example of how this is supposed to work, consider TWCS: some
SSTables will belong to a single window -> in which case they are
already in-strategy and don't need to be compacted, and others span
multiple windows in which case they are considered off-strategy and
have to be compacted.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
From the point of view of selecting SSTables and its expected output,
Reshaping really is just a normal compaction. However, there are some
key differences that we would like to uphold:
- Reshaping is done separately from the main SSTable set. It can be
done with the node offline, or it can be done in a separate priority
class. Either way, we don't want those SSTables to count towards
backlog. For reads, because the SSTables are not yet registered in
the backlog tracker (if offline or coming from upload), if we were
to deduct compaction charges from it we would go negative. For writes,
we don't want to deal with backlog management here because we will add
the SSTable at once when reshaping is finished.
- We don't need to do early replacements.
- We would like to clearly mark the Reshaping compactions as such in the
logs
For the reasons above, it is nicer to add a new Reshape compaction type,
a subclass of compaction, that upholds such properties.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This is so we don't always use MB. Sometimes it is best
to report GB, TB, and their equivalent throughput metrics.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
* seastar b515d63735...a6c8105443 (15):
> Merge "Move thread_wake_task out of line" from Rafael
> future: Fix result_of_apply instantiation
> future: Move the function in then/then_wrapped only once
> io-queue: Dont leak desc
> fair-queue: Keep request queues self-consistent
> app: Do not coredump on missing options
> future: promise: mark set_value as noexcept
> future: future_state: mark set as noexcept
> fair_queue_perf: Remove unused captures
> file_io_test: Add missing override
> Merge "tmp_dir: handle remove failure in do_with_thread" from Benny
> api-level: Add missing api_v4 namespace
> future: Fix CanApplyTuple
> http: use logger instead of stderr for erro reporting
> sstring: Generalize make_sstring a bit
There are some functions that are today pure that have an obvious
implementation (for example on_new_partition, do nothing). We'll add
default implementations to the compaction class, which reduces the
boilerplate needed to add a new compaction type.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This test wants to make sure that an SSTable with generation number 4,
which is incomplete, gets deleted.
While that works today, the way the test verifies that is fragile
because new SSTables can and will be created, especially in the local
directory that sees a lot of activity on startup.
It works if generations don't go that far, but with SMP, even a single
SSTable in the right shard can end up having generation 4. In practice
this isn't an issue today because the code calls
cf.update_sstables_known_generation() as soon as it sees a file, before
deciding whether or not the file has to be deleted. However this
behavior is not guaranteed and is changing.
The best way to fix this would be to check if the file is the same,
including its inode. But given that this is just a unit test (which
is almost always if not always single node), I am just moving to use
the peers table instead. Again, we could have created a user table,
but it's just not worth the hassle.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Using a map reduce in a shared sstable directory, finds the highest
version seen across all shards.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
It will be used to run any custom job where the caller provides a
function. One such example is indeed resharding, but reshaping SSTables
can also fall here.
The semaphore is also renamed, and we'll allow only one custom job at a
time (across all possible types).
We also remove the assumption of the scheduling group. The caller has to
have already placed the code in the correct CPU scheduling group. The
I/O priority class comes from the descriptor.
To make sure that we don't regress, we wrap the entire reshard-at-boot
code in the compaction class. Currently the setup would be done in the
main group, and the actual resharding in the compaction group. Note that
this is temporary, as this code is about to change.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This is already the case, since main.cc calls it from shard 0 and
relies on it to spread the information to the other shards. We will
turn this branch - which is always taken - into an assert for the
sake of future-proofing and soon add even more code that relies on this
being executed in shard 0.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are about to use the auto compaction property during the
populate/reshard process. If the user toggles it, the database can be
left in a bad state.
There should be no reason why a user would want to set that up this
early. So we'll disallow it.
To do that property, it is better if the check of whether or not
the storage service is ready to accomodate this request is local
to the storage service itself. We then move the logic of set_tables_autocompaction
from api to the storage service. The API layer now merely translates
the table names and pass it along.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The seastar api v4 changes the return type of when_all_succeed. This
patch adds discard_result when that is best solution to handle the
change.
This doesn't do the actual update to v4 since there are still a few
issues left to fix in seastar. A patch doing just the update will
follow.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200617233150.918110-1-espindola@scylladb.com>
This patch aim to make the implementation and usage of the
approx_exponential_histogram clearer.
The approx_exponential_histogram Uses a combination of Min, Max,
Precision and number of buckets where the user needs to pick 3.
Most of the changes in the patch are about documenting the class and
method, but following the review there are two functionality changes:
1. The user would pick: Min, Max and Precision and the number of buckets
will be calculated from these values.
2. The template restrictions are now state in a requires so voiolation
will be stop at compile time.
When debugging this for first time c412a7a, I thought the problem,
which causes backlog to be negative, was a bug in the implementation of the
formula, but it turns out that the bug is actually in the formula itself.
Not limiting the scope of this bug to STCS because its tracker is inherited
by the trackers of other strategies, meaning they're also affected by this.
The backlog for a SSTable is known to be
Bi = Ei * log(T / Si)
Where T = total Size minus compacted bytes for a table,
Ci = Compacted Bytes for a SSTable,
Si = Size of a SStable
Ei = Ci - Si
The problem was that we were assuming T > Si, but it can happen that T
is lower than Si if the table in question is decreasing in size.
If we rewrite SSTable backlog as
Bi = Ei * log (T) - Ei * log(Si)
It becomes even clearer why T cannot be lower than Si whatsoever,
or the backlog calculation can go wrong because first term becomes
lower than the second.
Fixing the formula consists of changing it to
Bi = Ei * log (T / Ei)
Bi = Ei * log (T) - Ei * log (Si - Ci)
After this change, the backlog still behave in a very similar way
as before, which can be confirmed via this graph:
https://user-images.githubusercontent.com/1409139/79627762-71afdf80-8111-11ea-9ebc-0831c4e3d9c6.pngFixes#6021.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200616174712.16505-1-raphaelsc@scylladb.com>
"
This patch series attempts to decouple package build and release
infrastructure, which is internal to Scylla (the company). The goal of
this series is to make it easy for humans and machines to build the full
Scylla distribution package artifacts, and make it easy to quickly
verify them.
The improvements to build system are done in the following steps.
1. Make scylla.git a super-module, which has git submodules for
scylla-jmx and scylla-tools. A clone of scylla.git is now all that
is needed to access all source code of all the different components
that make up a Scylla distribution, which is a preparational step to
adding "dist" ninja build target. A scripts/sync-submodules.sh helper
script is included, which allows easy updating of the submodules to the
latest head of the respective git repositories.
2. Make builds reproducible by moving the remaining relocatable package
specific build options from reloc/build_reloc.sh to the build system.
After this step, you can build the exact same binaries from the git
repository by using the dbuild version from scylla.git.
3. Add a "dist" target to ninja build, which builds all .rpm and .deb
packages with one command. To build a release, run:
$ ./tools/toolchain/dbuild ./configure.py --mode release
$ ./tools/toolchain/dbuild ninja-build dist
and you will now have .rpm and .deb packages to all the components of
a Scylla distribution.
4. Add a "dist-check" target to ninja build for verification of .rpm and
.deb packages in one command. To verify all the built packages, run:
$ ninja-build dist-check
Please note that you must run this step on the host, because the
target uses Docker under the hood to verify packages by installing
them on different Linux distributions.
Currently only CentOS 7 verification is supported.
All these improvements are done so that backward compatibility is
retained. That is, any existing release infrastructure or other build
scripts are completely unaffacted.
Future improvements to consider:
- Package repository generation: add a "ninja repo" command to generate
a .rpm and .deb repositories, which can be uploaded to a web site.
This makes it possible to build a downloadable Scylla distribution
from scylla.git. The target requires some configuration, which user
has to provide. For example, download URL locations and package
signing keys.
- Amazon Machine Image (AMI) support: add a "ninja ami" command to
simplify the steps needed to generate a Scylla distribution AMI.
- Docker image support: add a "ninja docker" command to simplify the
steps needed to generate a Scylla distribution Docker image.
- Simplify and unify package build: simplify and unify the various shell
scripts needed to build packages in different git repositories. This
step will break backward compatiblity and can be done only after
relevant build scripts and release infrastructure is updated.
"
* 'penberg/packaging/v5' of github.com:penberg/scylla:
docs: Update packaging documentation
build: Add "dist-check" target
scripts/testing: Add "dist-check" for package verification
build: Add "dist" target
reloc: Add '--builddir' option to build_deb.sh
build: Add "-ffile-prefix-map" to cxxflags
docs: Document sync-submodules.sh script in maintainer.md
sync-submodules.sh: Add script for syncing submodules
Add scylla-tools submodule
Add scylla-jmx submodule
Intersection was previously not tested for singular ranges. This
ensures it will always work for singular ranges, too.
Tests: unit(dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
The "promoted index" is how the sstable format calls the clustering key index within a given partition.
Large partitions with many rows have it. It's embedded in the partition index entry.
Currently, lookups in the promoted index are done by scanning the index linearly so the lookup
is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O.
We could do better and use binary search in the index. This patch series switches the mc-format
index reader to do that. Other formats use the old way.
The "mc" format promoted index has an extra structure at the end of the index called "offset map".
It's a vector of offsets of consecutive promoted index entries. This allows us to access random
entries in the index without reading the whole index.
The location of the offset entry for a given promoted index entry can be derived by knowing where
the offset vector ends in the index file, so the offset map also doesn't have to be read completely
into the memory.
The most tricky part is caching. We need to cache blocks read from the index file to amortize the
cost of binary search:
- if the promoted index fits in the 32 KiB which was read from the index when looking for
the partition entry, we don't want to issue any additional I/O to search the promoted index.
- with large promoted indexes, the last few bisections will fall into the same I/O block and we
want to reuse that block.
- we don't want the cache to grow too big, we don't want to cache the whole promoted index
as the read progresses over the index. Scanning reads may skip multiple times.
This series implements a rather simple approach which meets all the
above requirements and is not worse than the current state of affairs:
- Each index cursor has its own cache of the index file area which corresponds to promoted index
This is managed by the cached_file class.
- Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to
reuse information obtained during lower bound lookup. This estimation is used to limit
read-aheads in the data file.
- Each cursor drops entries that it walked past so that memory footprint stays O(log N)
- Cached buffers are accounted to read's reader_permit.
Later, we could have a single cache shared by many readers. For that, we need to come up with eviction
policy.
Fixes#4007.
TESTING RESULTS
* Point reads, large promoted index:
Config: rows: 10000000, value size: 2000
Partition size: 20 GB
Index size: 7 MB
Notes:
- Slicing read into the middle of partition (offset=5000000, read=1) is a clear win for the binary search:
time: 1.9ms vs 22.9ms
CPU utilization: 8.9% vs 92.3%
I/O: 21 reqs / 172 KiB vs 29 reqs / 3'520 KiB
It's 12x faster, CPU utilization is 10x times smaller, disk utilization is 20x smaller.
- Slicing at the front (offset=0) is a mixed bag.
time is similar: 1.8ms
CPU utilization is 6.7x smaller for bsearch: 8.5% vs 57.7%
disk bandwidth utilization is smaller for bsearch but uses more IOs: 4 reqs / 320 KiB (scan) vs 17 reqs / 188 KiB (bsearch)
bsearch uses less bandwidth because the series reduces buffer size used for index file I/O.
scan is issuing:
2 * 128 KB (index page)
2 * 32 KB (data file)
bsearch is issuing:
1 * 64 KB (index page)
15 * 4 KB (promoted index)
1 * 64 KB (data file)
The 1 * 64 KB is chosen dynamically by seastar. Sometimes it chooses 2 * 32 KB (with read-ahead).
32 KB is the minimum I/O currently.
Disk utilization could be further improved by changing the way seastar's dynamic I/O adjustments work
so that it uses 1 * 4 KB when it suffices. This is left for the follow-up.
Command:
perf_fast_forward --datasets=large-part-ds1 \
--run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1
Before:
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
0 1 0.001836 172 1 545 9 563 175 4.0 4 320 2 2 0 1 1 0 0 0 57.7% 0
0 32 0.001858 502 32 17220 126 17776 11526 3.2 3 324 2 1 0 1 1 0 0 0 56.4% 0
0 256 0.002833 339 256 90374 427 91757 85931 7.0 7 776 3 1 0 1 1 0 0 0 41.1% 0
0 4096 0.017211 58 4096 237984 2011 241802 233870 66.1 66 8376 59 2 0 1 1 0 0 0 21.4% 0
5000000 1 0.022952 42 1 44 1 45 41 29.2 29 3520 22 2 0 1 1 0 0 0 92.3% 0
5000000 32 0.023052 43 32 1388 14 1414 1331 31.1 32 3588 26 2 0 1 1 0 0 0 91.7% 0
5000000 256 0.024795 41 256 10325 129 10721 9993 43.1 39 4544 29 2 0 1 1 0 0 0 86.4% 0
5000000 4096 0.038856 27 4096 105414 398 106918 103162 95.2 95 12160 78 5 0 1 1 0 0 0 61.4% 0
After (v2):
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
0 1 0.001831 248 1 546 21 581 252 17.6 17 188 2 0 0 1 1 0 0 0 8.5% 0
0 32 0.001910 535 32 16751 626 17770 13896 17.9 19 160 3 0 0 1 1 0 0 0 8.8% 0
0 256 0.003545 266 256 72207 2333 89076 62852 26.9 24 764 7 0 0 1 1 0 0 0 9.7% 0
0 4096 0.016800 56 4096 243812 524 245430 239736 83.6 83 8700 64 0 0 1 1 0 0 0 16.6% 0
5000000 1 0.001968 351 1 508 19 538 380 21.3 21 172 2 0 0 1 1 0 0 0 8.9% 0
5000000 32 0.002273 431 32 14077 436 15503 11551 22.7 22 268 3 0 0 1 1 0 0 0 8.9% 0
5000000 256 0.003889 257 256 65824 2197 81833 57813 34.0 37 652 18 0 0 1 1 0 0 0 11.2% 0
5000000 4096 0.017115 54 4096 239324 834 241310 231993 88.3 88 8844 65 0 0 1 1 0 0 0 16.8% 0
After (v1):
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
0 1 0.001886 259 1 530 4 545 261 18.0 18 376 2 2 0 1 1 0 0 0 9.1% 0
0 32 0.001954 513 32 16381 93 16844 15618 19.0 19 408 3 2 0 1 1 0 0 0 9.3% 0
0 256 0.003266 318 256 78393 1820 81567 61663 30.8 26 1272 7 2 0 1 1 0 0 0 10.4% 0
0 4096 0.017991 57 4096 227666 855 231915 225781 83.1 83 8888 55 5 0 1 1 0 0 0 15.5% 0
5000000 1 0.002353 232 1 425 2 432 232 23.0 23 396 2 2 0 1 1 0 0 0 8.7% 0
5000000 32 0.002573 384 32 12437 47 12571 429 25.0 25 460 4 2 0 1 1 0 0 0 8.5% 0
5000000 256 0.003994 259 256 64101 2904 67924 51427 37.0 35 1484 11 2 0 1 1 0 0 0 10.6% 0
5000000 4096 0.018567 56 4096 220609 448 227395 219029 89.8 89 9036 59 5 0 1 1 0 0 0 15.1% 0
* Point reads, small promoted index (two blocks):
Config: rows: 400, value size: 200
Partition size: 84 KiB
Index size: 65 B
Notes:
- No significant difference in time
- the same disk utilization
- similar CPU utilization
Command:
perf_fast_forward --datasets=large-part-ds1 \
--run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1
Before:
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
0 1 0.000279 470 1 3587 31 3829 478 3.0 3 68 2 1 0 1 1 0 0 0 21.1% 0
0 32 0.000276 3498 32 116038 811 122756 104033 3.0 3 68 2 1 0 1 1 0 0 0 24.0% 0
0 256 0.000412 2554 256 621044 1778 732150 559221 2.0 2 72 2 0 0 1 1 0 0 0 32.6% 0
0 4096 0.000510 1901 400 783883 4078 819058 665616 2.0 2 88 2 0 0 1 1 0 0 0 36.4% 0
200 1 0.000339 2712 1 2951 8 3001 2569 2.0 2 72 2 0 0 1 1 0 0 0 17.8% 0
200 32 0.000352 2586 32 91019 266 92427 83411 2.0 2 72 2 0 0 1 1 0 0 0 20.8% 0
200 256 0.000458 2073 200 436503 1618 453945 385501 2.0 2 88 2 0 0 1 1 0 0 0 29.4% 0
200 4096 0.000458 2097 200 436475 1676 458349 381558 2.0 2 88 2 0 0 1 1 0 0 0 29.0% 0
After (v1):
Testing slicing of large partition using clustering keys:
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
0 1 0.000278 492 1 3598 30 3831 500 3.0 3 68 2 1 0 1 1 0 0 0 19.4% 0
0 32 0.000275 3433 32 116153 753 122915 92559 3.0 3 68 2 1 0 1 1 0 0 0 22.5% 0
0 256 0.000458 2576 256 559437 2978 728075 504375 2.1 2 88 2 0 0 1 1 0 0 0 29.0% 0
0 4096 0.000506 1888 400 790064 3306 822360 623109 2.0 2 88 2 0 0 1 1 0 0 0 36.6% 0
200 1 0.000382 2493 1 2619 10 2675 2268 2.0 2 88 2 0 0 1 1 0 0 0 16.3% 0
200 32 0.000398 2393 32 80422 333 84759 22281 2.0 2 88 2 0 0 1 1 0 0 0 19.0% 0
200 256 0.000459 2096 200 435943 1608 453989 380749 2.0 2 88 2 0 0 1 1 0 0 0 30.5% 0
200 4096 0.000458 2097 200 436410 1651 455779 382485 2.0 2 88 2 0 0 1 1 0 0 0 29.2% 0
* Scan with skips, large index:
Config: rows: 10000000, value size: 2000
Partition size: 20 GB
Index size: 7 MB
Notes:
- Similar time, slightly worse for binary search: 36.1 s (scan) vs 36.4 (bsearch)
- Slightly more I/O for bsearch: 153'932 reqs / 19'703'260 KiB (scan) vs 155'651 reqs / 19'704'088 KiB (bsearch)
Binary search reads more by 828 KB and by 1719 IOs.
It does more I/O to read the the promoted index offset map.
- similar (low) memory footprint. The danger here is that by caching index blocks which we touch as we scan
we would end up caching the whole index. But this is protected against by eviction as demonstrated by the
last "mem" column.
Command:
perf_fast_forward --datasets=large-part-ds1 \
--run-tests=large-partition-skips -c1 --test-case-duration=1
Before:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
1 1 36.103451 4 5000000 138491 38 138601 138453 153932.0 153932 19703260 153561 1 0 1 1 0 0 0 31.5% 502690
After (v2):
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
1 1 37.000145 4 5000000 135135 6 135146 135128 155651.0 155651 19704088 138968 0 0 1 1 0 0 0 34.2% 0
After (v1):
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
1 1 36.965520 4 5000000 135261 30 135311 135231 155628.0 155628 19704216 139133 1 0 1 1 0 0 0 33.9% 248738
Also in:
git@github.com:tgrabiec/scylla.git sstable-use-index-offset-map-v2
Tests:
- unit (all modes)
- manual using perf_fast_forward
"
* tag 'sstable-use-index-offset-map-v2' of github.com:tgrabiec/scylla:
sstables: Add promoted index cache metrics
position_in_partition: Introduce external_memory_usage()
cached_file, sstables: Add tracing to index binary search and page cache
sstables: Dynamically adjust I/O size for index reads
sstables, tests: Allow disabling binary search in promoted index from perf tests
sstables: mc: Use binary search over the promoted index
utils: Introduce cached_file
sstables: clustered_index: Relax scope of validity of entry_info
sstables: index_entry: Introduce owning promoted_index_block_position
compound_compat: Allow constructing composite from a view
sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view
sstables: mc: Extract parser for promoted index block
sstables: mc: Extract parser for clustering out of the promoted index block parser
sstables: consumer: Extract primitive_consumer
sstables: Abstract the clustering index cursor behavior
sstables: index_reader: Rearrange to reduce branching and optionals
This patch adds "-ffile-prefix-map" to cxxflags for all build modes.
This has two benefits:
1, Relocatable packages no longer have any special build flags, which
makes deeper integration with the build system possible (e.g.
targets for packages).
2 Builds are now reproducible, which makes debugging easier in case you
only have a backtrace, but no artifacts. Rafael explains:
"BTW, I think I found another argument for why we should always build
with -ffile-prefix-map=.
There was user after free test failure on next promotion. I am unable
to reproduce it locally, so it would be super nice to be able to
decode the backtrace.
I was able to do it, but I had to create a
/jenkins/workspace/scylla-master/next/ directory and build from there
to get the same results as the bot."
Acked-by: Botond Dénes <bdenes@scylladb.com>
Acked-by: Nadav Har'El <nyh@scylladb.com>
Acked-by: Rafael Avila de Espindola <espindola@scylladb.com>
This reverts commit ac7237f991. The logic
is wrong and always picks "podman" if it's installed on the system even
if user asks for "docker" with the DBUILD_TOOL environment variable.
This wreaks havoc on machines that have both docker and podman packages
installed, but podman is not configured correctly.
When a token is calculated for stream_id, we check that the key is
exactly 16 bytes long. If it's not - `minimum_token` is returned
and client receives empty result.
This used to be the expected behavior for empty keys; now it's
extended to keys of any incorrect length.
Fixes#6570
All tests that write some data and then read it back need to use
ConsistentRead=True, otherwise the test may sporadically fail on a multi-
node cluster.
In the previous patch we fixed the full_query()/full_scan() convenience
functions. In this patch, I audited the calls to the boto3 read methods -
get_item(), batch_get_item(), query(), scan(), and although most of them
did use ConsistentRead=True as needed, I found some missing and this patch
fixes them.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200616080334.825893-1-nyh@scylladb.com>
Many of the Alternator tests use the convenience functions full_query()/
full_scan() to read from the table. Almost all these tests need to be able
to read their own writes, i.e., want ConsistentRead=True, but none of them
explicitly specified this parameter. Such tests may sporadically fail when
running on cluster with multiple nodes.
So this patch follows a TODO in the code, and makes ConsistentRead=True
the default for the full_*() functions. The caller can still override it
with ConsistentRead=False - and this is necessary in the GSI tests, because
ConsistentRead=True is not allowed in GSIs.
Note that while ConsistentRead=True is now the default for the full_*()
convenience functions, but it is still not the default for the lower level
boto3 functions scan(), query() and get_item() - so usages of those should
be evaluated as well and missing ConsistentRead=True, if any, should be
added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200616073821.824784-1-nyh@scylladb.com>
SSTables created for the upload directory should be using its custom error
handler.
There is one user of the custom error handler in tree, which is the current
upload directory function. As we will use a free function instead of a lambda
in our implementation we also use the opportunity to fix it for consistency.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
I just noticed while working on the reshape patches that there
is an extra format bracket in two of the debug message. As they
are debug I've seen them less often than the others and that slipped.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Merged patch series by Rafael Ávila de Espíndola:
The main advantage is that callers now don't have to construct
sstrings. It is also a 0.09% win in text size (from 41804308 to
41766484 bytes) and the tps reported by
perf_simple_query --duration 16 --smp 1 -m4G >> log 2>err
in 500 randomized runs goes up by 0.16% (from 162259 to 162517).
Rafael Ávila de Espíndola (3):
service: Pass a std::string_view to client_state::set_keyspace
cql3: Use a flat_hash_map in untyped_result_set_row
cql3: Pass std::string_view to various untyped_result_set member
functions
cql3/untyped_result_set.hh | 30 ++++++++++++++++--------------
service/client_state.hh | 2 +-
cql3/untyped_result_set.cc | 6 +++---
service/client_state.cc | 4 ++--
4 files changed, 22 insertions(+), 20 deletions(-)
Debian package builds provide a root environment for the installation
scripts, since that's what typical installation scripts expect. To
avoid providing actual root, a "fakeroot" system is used where syscalls
are intercepted and any effect that requires root (like chown) is emulated.
However, fakeroot sporadically fails for us, aborting the package build.
Since our install scripts don't really require root (when operating in
the --packaging mode), we can just tell dpkg-buildpackage that we don't
need fakeroot. This ought to fix the sporadic failures.
As a side effect, package builds are faster.
Fixes#6655.
Currently, index reader uses 128 KiB I/O size with read-ahead. That is
a waste of bandwidth if index entries contain large promoted index and
binary search will be used within the promoted index, which may not
need to access as much.
The read-ahead is wasted both when using binary search and when using
the scanning cursor.
On the other hand, large I/O is optimal if there is no promoted index
and we're going to parse the whole page.
There is no way to predict which case it is up front before reading
the index.
Attaching dynamic adjustments (per-sstable) lets the system auto adjust
to the workload from past history.
The large promoted index workload will settle on reading 32 KiB (with
read-ahead). This is still not optimal, we should lower the buffer
size even more. But that requires a seastar change, so is deferred.
Currently, lookups in the promoted index are done by scanning the index linearly so the lookup
is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O.
We could do better and use binary search in the index. This patch series switches the mc-format
index reader to do that. Other formats use the old way.
The "mc" format promoted index has an extra structure at the end of the index called "offset map".
It's a vector of offsets of consecutive promoted index entries. This allows us to access random
entries in the index without reading the whole index.
The location of the offset entry for a given promoted index entry can be derived by knowing where
the offset vector ends in the index file, so the offset map also doesn't have to be read completely
into the memory.
The most tricky part is caching. We need to cache blocks read from the index file to amortize the
cost of binary search:
- if the promoted index fits in the 32 KiB which was read from the index when looking for
the partition entry, we don't want to issue any additional I/O to search the promoted index.
- with large promoted indexes, the last few bisections will fall into the same I/O block and we
want to reuse that block.
- we don't want the cache to grow too big, we don't want to cache the whole promoted index
as the read progresses over the index. Scanning reads may skip multiple times.
This patch implements a rather simple approach which meets all the
above requirements and is not worse than the current state of affairs:
- Each index cursor has its own cache of the index file area which corresponds to promoted index
This is managed by the cached_file class.
- Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to
reuse information obtained during lower bound lookup. This estimation is used to limit
read-aheads in the data file.
- Each cursor drops entries that it walked past so that memory footprint stays O(log N)
- Cached buffers are accounted to read's reader_permit.
It is a read-through cache of a file.
Will be used to cache contents of the promoted index area from the
index file.
Currently, cached pages are evicted manually using the invalidate_*()
method family, or when the object is destroyed.
The cached_file represents a subset of the file. The reason for this
is to satisfy two requirements. One is that we have a page-aligned
caching, where pages are aligned relative to the start of the
underlying file. This matches requirements of the seastar I/O engine
on I/O requests. Another requirement is to have an effective way to
populate the cache using an unaligned buffer which starts in the
middle of the file when we know that we won't need to access bytes
located before the buffer's position. See populate_front(). If we
couldn't assume that, we wouldn't be able to insert an unaligned
buffer into the cache.
entry_info holds views, which may get invalidated when the containing
index blocks are removed. Current implementations of next_entry() keeps
the blocks in memory as long as the cursor is alive but that will
change in new implementations of the cursor.
Adjust the assumption of tests accordingly.
In preparation for supporting more than one algorithm for lookups in
the promoted index, extract relevant logic out of the index_reader
(which is a partition index cursor).
The clustered index cursor implementation is now hidden behind
abstract interface called clustered_index_cursor.
The current implementation is put into the
scanning_clustered_index_cursor. It's mostly code movement with minor
adjustments.
In order to encapsulate iteration over promoted index entries,
clustered_index_cursor::next_entry() was introduced.
No change in behavior intended in this patch.
This adds support for configuring whether to run dbuild with 'docker' or
'podman' via a new environment variable, DBUILD_TOOL. While at it, check
if 'podman' exists, and prefer that by default as the tool for dbuild.
In this patch I rewrote the explanations in both README.md and HACKING.md
about Scylla's dependencies, and about dbuild.
README.md used to mention only dbuild. It now explains better (I think)
why dbuild is needed in the first place, and that the alternative is
explained in HACKING.md.
HACKING.md used to explain *only* install-dependencies.sh - and now explains
why it is needed, what install-dependencies.sh and that it ONLY works on
very recent distributions (e.g., Fedora older than 32 are not supported),
and now also mentions the alternative - dbuild.
Mentions of incorrect requirements (like "gcc > 8.1") were fixed or dropped.
Mention of the archaic 'scripts/scylla_current_repo' script, which we used
to need to install additional packages on non-Fedora systems, was dropped.
The script itself is also removed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200616100253.830139-1-nyh@scylladb.com>
nonwrapping_range<T> and related templates represent mathematical
intervals, and are different from C++ ranges. This causes confusion,
especially when C++ ranges and the range templates are used together.
As the first step to disentable this, introduce a new interval.hh
header with the contents of the old range.hh header, renaming as
follows:
range_bound -> interval_bound
nonwrapping_range -> nonwrapping_interval
wrapping_range -> wrapping_interval
Range -> Interval (concepts)
The range alias, which previously aliased wrapping_range, did
not get renamed - instead the interval alias now aliases
nonwrapping_interval, which is the natural interval type. I plan
to follow up making interval the template, and nonwrapping_interval
the alias (or perhaps even remove it).
To avoid churn, a new range.hh header is provided with the old names
as aliases (range, nonwrapping_range, wrapping_range, range_bound,
and Range) with the same meaning as their former selves.
Tests: unit (dev)
This series contains two improvements to hint file replay logic
in hints manager:
- During replay of a hint file, keeping track of the first hint that fails
to be sent is now done via a simple std::optional variable instead of an
unordered_set. This slightly reduces complexity of next replay position
calculation.
- A corner case is handled: if reading commitlog fails, but there won't be an
error related to sending hints, starting position wouldn't be updated. This
could cause us to replay more hints than necessary.
Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test, dev)
* piodul-hints-manager-handle-commitlog-failure-in-replay-position-calculation:
hinted handoff: use bool instead of send_state_set
hinted handoff: update replay position on commitlog failure
hinted handoff: remove rps_set, use first_failed_rp instead
* seastar 81242ccc3f...8f0858cfd7 (18):
> Merge 'future, future-utils: stop returning a variadic future from when_all_succeed'
> file: introduce layered_file_impl, a helper for layered files
> net: packet: mark move assignment operator as noexcept
> core: weak_ptr, weakly_referencable: implement empty default constructor
> circular_buffer: Fix build with gcc 11 (avoid template parameters in d'tor declaration)
> test: weak_ptr_test: fix static asserts about nothrow constructibility
> coroutines: Fix clang build
> cmake: Delete SEASTAR_COROUTINES_TS
> Merge "future-util: Mark a few more functions as noexcept" from Rafael
> tests: add a perf test to measure the fair_queue performance
> Merge "iostream: make iostream stack nothrow move constructible" from Benny
> future: Move most of rethrow_with_nested out of line.
> future_test: Add test for nested exceptions in finally
> core: Add noexcept to unaligned members functions
> Merge "core: make weak_ptr and checked_ptr default and move nothrow constructible" from Benny
> core: file: Fix typo in a comment
> byteorder: Mark functions as noexcept
> future: replace CanInvoke concepts with std::invocable
Avi says:
"The backlog is a large number that changes slowly, so float
might not have enough resolution to track small changes.
For example, if the backlog is 800GB and changes less than 100kB, then
we won't see a change (float resolution is 2^23 ~ 1:8,000,000).
This is outside the normal range of values (usually the backlog changes
a lot more than 100kB per 15-second period), so it will work, but better
to be more careful."
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200615150621.17543-1-raphaelsc@scylladb.com>
The patch introduces two new features to aid with negotiating
protocol extensions for the CQL protocol:
- `cql_protocol_extensions` enum, which holds all supported
extensions for the CQL protocol (currently contains only
`LWT_ADD_METADATA_MARK` extension, which will be mentioned
below).
- An additional mechainsm of negotiating cql protocol extensions
to be used in a client connection between a scylla server
and a client driver.
These extensions are propagated in SUPPORTED message sent from the
server side with "SCYLLA_" prefix and received back as a response
from the client driver in order to determine intersection between
the cql extensions that are both supported by the server and
acknowledged by a client driver.
This intersection of features is later determined to be a working
set of cql protocol extensions in use for the current `client_state`,
which is associated with a particular client connection.
This way we can easily settle on the used extensions set on
both sides of the connection.
Currently there is only one value: `LWT_ADD_METADATA_MARK`, which
regulates whether to set a designated bit in prepared statement
metadata indicating if the statement at hand is an lwt statement
or not (actual implementation for the feature will be in a later
patch).
Each extension can also propagate some custom parameters to the
corresponding key. CQL protocol specification allows to send
a list of values with each key in the SUPPORTED message, we use
that to pass parameters to extensions as `PARAM=VALUE` strings.
In case of `LWT_ADD_METADATA_MARK` it's
`SCYLLA_LWT_OPTIMIZATION_META_BIT_MASK` which designates the
bitmask for LWT flag in prepared statement metadata in order to be
used for lookup in a client library. The associated bits of code in
`cql3::prepared_metadata` are adjusted to accomodate the feature.
The value for the flag is chosen on purpose to be the last bit
in the flags bitset since we don't want to possibly clash with
C* implementation in case they add more possible flag values to
prepared metadata (though there is an issue regarding that:
https://issues.apache.org/jira/browse/CASSANDRA-15746).
If it's fixed in upstream Cassandra, then we could synchronize
the value for the flag with them.
Also extend the underlying type of `flag` enum in
`cql3::prepared_metadata` to be `uint32_t` instead of `uint8_t`
because in either case flags mask is serialized as 32-bit integer.
In theory, shard-awareness extension support also should be
reworked in terms of provided minimal infrastructure, but for the
sake of simplicity, this is left to be done in a follow-up some
time later.
This solution eliminates the need to assume that all the client
drivers follow the CQL spec carefully because scylla-specific
features and protocol extensions could be enabled only in case both
server and client driver negotiate the supported feature set.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
No functionality changed. This just makes it possible to use
heterogeneous lookups, which the next patch will add.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
No change in the implementation since it was already copying the
string. Taking a std::string_view is just a bit more flexible.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
But not compaction.
When reclaiming segments to seastar non-empty segments are copied
as-is to some other place. Instead of doing this reclaimer can copy
only allocated objects and leave the freed holes behing, i.e. -- do
the regular compaction. This would be the same or better from the
timing perspective, and will help to avoid yet another compaction
pass over the same set of objects in the future.
Current migration code checks for the free segments reserve to be
above minimum to proceed with migration, so does the code after this
patch, thus the segment compaction is called with non-empty free
segments set and thus it's guaranteed not to fail the new segment
allocation (if it will be required at all).
Plus some bikeshedding patches for the run-up.
tests: unit(dev)
* https://github.com/xemul/scylla/tree/br-logalloc-compact-on-reclaim-2:
logalloc: Compact segments on reclaim instead of migration
logallog: Introduce RAII allocation lock
logalloc: Shuffle code around region::impl::compact
logalloc: Do not lock reclaimer twice
logalloc: Do not calculate object size twice
logalloc: Do not convert obj_desc to migrator back and forth
SSTable_set is now an optional, and if we don't want to expire data
it will be empty. We need to check that it is not empty before dereferencing
it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200610170647.142817-1-glauber@scylladb.com>
This is relevant only when using partition or clustering keys which
have a representation in memory which is larger than 12.8 KB (10% of
LSA segment size).
There are several places in code (cache, background garbage
collection) which may need to linearize keys because of performing key
comparison, but it's not done safely:
1) the code does not run with the LSA region locked, so pointers may
get invalidated on linearization if it needs to reclaim memory. This
is fixed by running the code inside an allocating section.
2) LSA region is locked, but the scope of
with_linearized_managed_bytes() encloses the allocating section. If
allocating section needs to reclaim, linearization context will
contain invalidated pointers. The fix is to reorder the scopes so
that linearization context lives within an allocating section.
Example of 1 can be found in
range_populating_reader::handle_end_of_stream() where it performs a
lookup:
auto prev = std::prev(it);
if (prev->key().equal(*_cache._schema, *_last_key->_key)) {
it->set_continuous(true);
but handle_end_of_stream() is not invoked under allocating section.
Example of 2 can be found in mutation_cleaner_impl::merge_some() where
it does:
return with_linearized_managed_bytes([&] {
...
return _worker_state->alloc_section(region, [&] {
Fixes#6637.
Refs #6108.
Tests:
- unit (all)
Message-Id: <1592218544-9435-1-git-send-email-tgrabiec@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6551
from Juliusz Stasiewicz:
The command regenerates streams when:
generations corresponding to a gossiped timestamp cannot be
fetched from system_distributed table,
or when generation token ranges do not align with token metadata.
In such case the streams are regenerated and new timestamp is
gossiped around. The returned JSON is always empty, regardless of
whether streams needed regeneration or not.
Fixes#6498
Accompanied by: scylladb/scylla-jmx#109, scylladb/scylla-tools-java#172
Since scylla-cpupower.service isn't installed by .rpm package, but created
in the setup script, it's better to not use /usr/lib directory, use /etc.
We already doing same way for scylla-server.service.d/*.conf, *.mount, and
*.swap created by setup scripts.
In decommission operation, current code requires a node in local dc to
sync data with. This requirement is too strong for a non network topology
strategy. For example, consider:
n1 dc1
n2 dc1
n3 dc2
n2 runs decommission operation. For a keyspace with simple strategy and
RF = 2, it is possible n3 is the new owner but n3 is not in the same dc
as n2.
To fix, perform the dc check only for the network topology strategy.
Fixes#6564
"
This series Adds a pseudo-floating-point histogram implementation.
The histogram is used for time_estimated_histogram a histogram for latency tracking and then used in storage_proxy as a more efficient with a higher resolution histogram.
Follow up series would use the new histogram in other places in the system and will add an implementation that supports lower values.
Fixes#5815Fixes#4746
"
* amnonh-quicker_estimated_histogram:
storage_proxy: use time_estimated_histogram for latencies
test/boost/estimated_histogram_test
utils/histogram_metrics_helper Adding histogram converter
utils/estimated_histogram: Adding approx_exponential_histogram
5ceb20c439 switched --enable-dpdk
to a tristate switch, but forgot that add_tristate() prepends
--enable and --disable itself; so now the switch looks like
--enable-enable-dpdk and --disable-enable-dpdk.
Fix by removing the "enable-" prefix.
This patch adds a helper converter function to convert from a
approx_exponential_histogram histogram to a seastar::metrics::histogram
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds an efficient histogram implementation.
The implementation chooses efficiency over flexibility.
That is why templates are used.
How the approx_exponential_histogram pseudo floating point histogram
works: It split the range [MIN, MAX] into log2(MAX/MIN) ranges it then
split each of that ranges linearly according to a given resolution.
For example, using resolution of 4, would be similar to using an
exponentially growing histogram with a coefficient of 1.2.
All values are uint64. To prevent handling of corner cases, it is not
allowed to set the MIN to be lower than the resolution.
The approx_exponential_histogram will probably not be used directly,
the first used is by time_estimated_histogram. A histogram for durations.
It should be compared to the estimated_histogram.
Performance comparison:
Comparison was done by inserting 2^20 values into
time_estimated_histogram and estimated_histogram.
In debug mode on a local machine insert operation took an average of
26.0 nanoseconds vs 342.2 nanoseconds.
In release mode insert operation took an average of 1.90 vs 8.28 nanoseconds
Fixes#5815
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The main goal of this series is to implement FilterExpression - the
newer syntax for filtering results of Query and Scan requests.
This feature itself is just one simple patch - it just needs to have the
already-existing filtering code call the already-existing expression
evaluation code. However, before we can do this, we need a patch to
refactor the expression-evaluation interface (this patch also fixes
pre-existing bugs). Then we need three additional patches to fix pre-
existing bugs in the various corner cases of expressions (this bugs
already existed in ConditionExpression but now became visible in
tests for FilerExpression). Finally, in the end of the series, we also
do a bit of code cleanup.
After this series, the FilterExpression feature is complete, and all
tests for this feature pass.
Tests: unit(dev)
* 'alternator-filterexpression' of git://github.com/nyh/scylla:
alternator: avoid unnecessary conversion to string
alternator: move some code out of executor.cc
alternator: implement FilterExpression
alternator: improve error path of attribute_type() function
alternator: fix begins_with() error path
alternator: fix corner case of contains() function in conditions
alternator: refactor resolving of references in expressions
"
This is part of the work for replacing global sstring variables with
constexpr std::string_view ones.
To have std::string_view values we have to convert a few APIs to take
std::string_view instead of sstring references.
The API conversions are complicated by the fact that
std::unordered_map doesn't support heterogeneous lookup, so we need
another hash map.
The one provided by abseil seems like a natural choice since it has an
API that looks like what is being proposed for c++
(http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2019/p1690r0.html)
but is also much faster.
A nice side effect is that this series is a 0.46% win in
perf_simple_query --duration 16 --smp 1 -m4G
Over 500 runs with randomized section layout and environment on each
run.
"
* 'espindola/absl-v10' of https://github.com/espindola/scylla:
database: Use a flat_hash_map for _ks_cf_to_uuid
database: Use flat_hash_map for _keyspaces
Add absl wrapper headers
build: Link with abseil
cofigure: Don't overwrite seastar_cflags
Add abseil as a submodule
Given that the key is a std::pair, we have to explicitly mark the hash
and eq types as transparent for heterogeneous lookup to work.
With that, pass std::string_view to a few functions that just check if
a value is in the map.
This increases the .text section by 11 KiB (0.03%).
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This changes the hash map used for _keyspaces. Using a flat_hash_map
allows using std::string_view in has_keyspace thanks to the
heterogeneous lookup support.
This add 200 KiB to .text, since this is the first use of absl and
brings in files from the .a.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Using these instead of using the absl headers directly adds support
for heterogeneous lookup with sstring as key.
The is no gain from having the hash function inline, so this
implements it in a .cc file.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
It is a pity we have to list so many libraries, but abseil doesn't
provide a .pc file.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The variable seastar_cflags was being used for flags passed to seastar
and for flags extracted from the seastar.pc file.
This introduces a new variable for the flags extracted from the
seastar.pc file.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This adds the https://abseil.io library as a submodule. The patch
series that follows needs a hash table that supports heterogeneous
lookup, and abseil has a really good hash table that supports that
(https://abseil.io/blog/20180927-swisstables).
The library is still not available in Fedora, but it is fairly easy to
use it directly from a submodule.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
Use table_id instead of table_name in row level repair to find a table. It
guarantees we repair the same table even if a table is dropped and a new
table is created with the same name.
Refs: #5942
"
* asias-repair_use_table_id_instead_of_table_name:
repair: Do not pass table names to repair_info
repair: Add table_id to row_level_repair
repair: Use table id to find a table in get_sharder_for_tables
repair: Add table_ids to repair_info
repair: Make func in tracker::run run inside a thread
This backlog metric holds the sum of backlog for all the tables
in the system. This is very useful for understanding the behavior
of the backlog trackers. That's how we managed to fix most of
backlog bugs like #6054, #6021, etc.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200612194908.39909-1-raphaelsc@scylladb.com>
"
The cql_server and thrift are "owned" by storage_service for
the sake of managing those, i.e. starting and stopping. Since
other services (still) need the storage_service this creates
dependencies loops.
This set makes the client services independent from the storage
service. As a consequence of it the auth service is also removed
from storage_service and put standalone. This, in turn, sets
some tests free from the need to start and stop auth and makes
one step towards NOT join_cluster()-ing in unit tests.
Also the set fixes few wierd races on scylla start and stop
that can trigger local_is_initialized() asserts, and one case of
unclear aborted shutdown when client services remain running
till the scylla process exit.
Yet another benefit is localization of "isolating" functionality
that sits deeper in storage_service than it should.
One thing that's not completely clean after it is the need for cql
server to continue referencing the service_memory_limiter semaphore
from the storage_service, but this will go away with one of the
next sets.
tests: unit(debug), manual start-stop,
nodetool check of cql/thrift start/stop
"
* 'br-split-transport-1' of https://github.com/xemul/scylla:
storage_service: Isolate isolator
auth: Move away from storage_service
auth: Move start-stop code into main
main: Don't forget to stop cql/thrift when start is aborted
thrift_controller: Switch on standalone
thrift_controller: Pass one through management API
thrift_controller: Move the code into thrift/
thrift_controller: Introduce own lock for management
thrift: Wrap start/stop/is_running code into a class
cql_controller: Switch on standalone
cql_controller: Pass one through management API
cql_controller: Move the code into transport/
cql_controller: Introduce own lock for management
cql: Wrap start/stop/is_running code into a class
api: Tune reg/unreg of client services control endpoints
A lot is going on when calculating effective ownership.
For each node in the cluster, we need to go over all the ranges belong
to that node and see if that node is the owner or not.
This patch uses futurized loops with do_for_each so it would preempt if
needed.
The patch replaces the current for-loops with do_for_each and do_with
but keeps the logic.
Fixes#6380
In a couple of places, where we already have a std::string_view, there
is no need to convert to to a std::string (which requires allocation).
One cool observation (by Piotr Sarna) is that map over std::string_view
is fine, when the strings in the map are always string constants.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The source file alternator/executor.cc has grown too much, reaching almost
4,000 lines. In this patch I move about 400 lines out of executor.cc:
1. Some functions related to serialization of sets and lists were moved to
serialization.cc,
2. Functions related to evaluating parsed expressions were moved to
expressions.cc.
The header file expressions_eval.hh was also removed - the calculate_value()
functions now live in expressions.cc, so we can just define them in
expressions.hh, no need for a separate header files.
This patch just moves code around. It doesn't make any functional changes.
Refs #5783.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch provides a complete implementation for the FilterExpression
parameter - the newer syntax for filtering the results of the Query or
Scan operations.
The implementation is pretty straightforward - we already added earlier
a result-filtering framework to Alternator, and used it for the older
filtering syntax - QuryFilter and ScanFilter. All we had to do now was
to run the FilterExpression (which has the same syntax as a
ConditionExpression) on each individual items. The previous cleanup
patches were important to reduce the friction of running these expressions
on the items.
After the previous patches fixing small esoteric bugs in a few expression
functions, with this patch *all* the tests in test_filter_expression.py
now pass, and so do the two FilterExpression tests in test_query.py and
test_scan.py. As far as I know (and of course minus any bugs we'll discover
later), this marks the FilterExpression feature complete.
Fixes#5038.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The attribute_type() function, which can be used in expressions like
ConditionExpression and FilterExpression, is supposed to generate an
error if its second parameter is not one of the known types. What we
did until now was to just report a failed check in this case.
We already had a reproducing test with FilterExpression, but in this patch
we also add a test with ConditionExpression - which fails before this
patch and passes afterwards (and of course, passes with DynamoDB).
Fixes#6641.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The begins_with() function should report an error if a constant is
passed to it which isn't one of the supported types - string or bytes
(e.g., a number).
The code we had to check this had wrong logic, though. If the item
attribute was also a number, we silently returned false, and didn't
go on to detect that the second parameter - a constant - was a number
too and should generate an error - not be silent.
Fixed and added a reproducing test case and another test to validate
my understanding of the type of parameters that begins_with() accepts.
Fixes#6640.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
It turns out that the contains() functions in the new syntax of
conditions (ConditionExpression, FilterExpression) is not identical
to the CONTAINS operator in the old-syntax conditions (Expected).
In the new syntax, one can check whether *any* constant object is contained
in a list. In the old syntax, the constant object must be of specific
types.
So we need to move the testing out of the check_CONTAINS() functions
that both implementations used, and into just the implementation of
the old syntax (in conditions.cc).
This bug broke one of the FilterExpression tests, but this patch also
adds new tests for the different behaviour of ConditionExpression and
Expected - tests which also reproduce this issue and verify its fix.
Fixes#6639.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the DynamoDB API, expressions (e.g., ConditionExpression and many more)
may contain references to column names ("#name") or to values (":val")
given in a separate part of the request - ExpressionAttributeNames and
ExpressionAttributeValues respectively.
Before this patch, we resolved these references as part of the expression's
evaluation. This approach had two downsides:
1. It often misdiagnosed (both false negatives and false positives) cases
of unused names and values in expressions. We already had two xfailing
tests with examples - which pass after this patch. This patch also
adds two additional tests, which failed before this patch and pass
with it.
2. In one of the following patches we will add support for FilterExpression,
where the same expression is used repeatedly on many items. It is a waste
(as well as makes the code uglier) to resolve the same references again
and again each time the expression is evaluated. We should be able
to do it just once.
So this patch introduces an intermediate step between parsing and evaluating
an expression - "resolving" the expression. The new resolve_*() functions
modify the already parsed expression, replacing references to attribute
names and constant values by the actual names and values taken from the
request. The resolve_*() functions also keep track which references were
used, making it very easy to check (as DynamoDB does) if there are any
unused names or values, before starting the evaluation.
The interface of evaluate() functions become much simpler - they no longer
need to know the original request (which was previously needed for
ExpressionAttributeNames/Values), the table's schema (which was previously
needed only for some error checking), keep track of which references were
used. This simplification is helpful for using the expressions in contexts
where these things (request and schema) are no longer conveniently available,
namely in FilterExpression.
A small side-benefit of this patch is that it moves a bit of code, which
handled resolving of references in expressions, from executor.cc to
expressions.cc. This is just the first step in a bigger effort to
reduce the size of executor.cc by moving code to smaller source files.
There is no attempt in this patch to move as much code as we can.
We will move more code in a separate patch in this series.
Fixes#6572.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
LCS and SCTS already have their own files, reducing the clutter in
compaction_strategy.cc. Do the same for TWCS. I am doing this in
preparation to add more functions.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200611230906.409023-6-glauber@scylladb.com>
There is a code that isolates a node on disk error. After all the previous
changes this code can be collected in one place (better to move it away from
storage_service at all, but still).
This simplifies the stop_transport(): now it can avoid rescheduling itself
on shard 0 for the 2nd time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now after the auth start/stop is standalone, we can remove
reference from storage service to it. This frees some tests
from the need to carry the auth service around for nothing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The auth service management is currently sitting in storage
service, but it was needed there just for cql/thrift start
code. After the latters has been moved away there are no
other reasons for the auth to be integrated with the storage
service, so move it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The defer action for stopping the storage_service is registered
very late, after the cql and thrift started. If an error happens
in between, these client-shutdown hooks will not be called.
This is a problem with the hooks, but fixing it in hooks place
is a big rework, so for now put fuses for cql and thrift
individually -- both their stopping codes are re-entrable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Remove the on-storage_service instance and make everybody use
th standalone one.
Stopping the thrift is done by registering the controller in
client service shutdown hooks. This automatically wires the
stopping into drain, decommission and isolation codes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to make the relevant endpoints work on standalone
thrift controller instead of the storage_service's one, so
prepare this controller (dummy for now) and pass it all the
way down the API code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently start/stop of thrift is protected with storage_service's
run_with_api_lock, but this protection is purely needed to
guard start and stop against each other, not from anything else.
For the sake of thrift management isolation it's worth having its own
start-stop lock. This also decouples thrift code from storage_service's
"isolated" thing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The plan is to decouple thrift management code from
storage_service and move into thrift/ directory, so
prepare for that by introducing a controller class.
This leaves some unclean indentation in start/stop helpers
to reduce the churn, it will be fixed two patches ahead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Remove the on-storage_service instance and make everybody use
th standalone one.
Stopping the server is done by registering the controller in
client service shutdown hooks. This automatically wires the
stopping into drain, decommission and isolation codes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to make the relevant endpoints work on standalone
cql controller instead of the storage_service's one, so
prepare this controller (dummy for now) and pass it all the
way down the API code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently start/stop of cql is protected with storage_service's
run_with_api_lock, but this protection is purely needed to
guard start and stop against each other, not from anything else.
For the sake of cql server isolation it's worth having its own
start-stop lock. This also decouples cql code from storage_service's
"isolated" thing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The plan is to decouple cql server management code from
storage_service and move into transport/ directory, so
prepare for that by introducing a controller class.
This leaves some unclean indentation in start/stop helpers
to reduce the churn, it will be fixed two patches ahead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currntly API endpoints to start and stop cql_server and thrift
are registered right after the storage service is started, but
much earlier than those services are. In between these two
points a lot of other stuff gets initialized. This opens a small
window during which cql_server and thrift can be started by
hand too early.
The most obvious problem is -- the storage_service::join_cluster()
may not yet be called, the auth service is thus not started, but
starting cql/thrift needs auth.
Another problem is those endpoints are not unregistered on stop,
thus creating another way to start cql/thrif at wrong time.
Also the endpoints registration change helps further patching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After restart_segment was removed from send_state enum, send_state_set
now has only one possible element: segment_replay_failed.
This patch removes send_state_set and uses bool in its place instead.
Hints manager uses commitlog framework to store and replay hints.
The commitlog::read_log_file function is used for replaying hints. It
reads commitlog entries and passes them to a callback. In case of hints
manager, the callback calls manager::send_one_hint function.
In case something goes wrong during this process, sending of that file
is attempted again later. If the error was caused by hints that failed
to be sent (e.g. due to network error), then we also advance
_last_not_complete_rp field to the position of the first hint that
failed. In the next retry, we will start reading from the commitlog from
that position.
However, current logic does not account for the case when an error
occurs in the commitlog::read_log_file function itself. If,
coincidentally, all hints sent by send_one_hint succeed, then we won't
advance the _last_not_complete_rp field and we may unnecessarily repeat
sending some of the hints that succeeded.
This patch adds the send_one_file_ctx::last_sent_rp field, which keeps
track of the last commitlog position for which a hint was attempted to
be sent. In case read_log_file throws an error but all send_one_hint
calls succeed, then it will be used to update _last_not_complete_rp.
This will reduce the amount of hints that are resent in this case to
only one.
Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test, dev)
When sending hints from one file, rps_set is used to keep track of
positions of hints that are currently sent. If sending of a hint fails,
its position is not removed from rps_set. If some hints fail to be sent
while handling a hints file, the lowest position from rps_set is used
to calculate the position from where to start when sending of the file
is retried.
Keeping track of commitlog positions this way isn't necessary to
calculate this position. This patch removes rps_set and replaces it
with first_failed_rp - which is just a single
std::optional<db::replay_position>. This value is updated when a hint
send failure is detected.
This simplifies calculation of starting position for the next retry, and
allowed to remove some error handling logic related to an edge case when
inserting to rps_set fails.
- unit(dev)
- dtest(hintedhandoff_additional_test, dev)
tracked_file_impl is a wrapper around another file, that tracks
memory allocated for buffers in order to control memory consumption.
However, it neglects to inherit the disk and memory alignment settings
from the wrapped file, which can cause unnecessarily-large buffers
to be read from disk, reducing throughput.
Fix by copying the alignment parameters.
Fixes#6290.
To reduce special cases for the build bots, default dpdk to enabled
in release mode, keeping it disabled for debug and dev.
To allow release modes without dpdk to be build, the --enable-dpdk
switch is converted to a tri-state. When disabled, dpdk is disabled
across all modes. Similarly when enabled the effect is global. When
unspecified, dpdk is enabled for release mode only.
After this change, reloc/build_reloc.sh no longer needs to specify
--enable-dpdk, so remove it.
The messaging service constructor's body does two main things in this
order:
1. it registers the CLIENT_ID verb with rpc.
2. it initializes the scheduling mechanism in charge of locating the
right scheduling group for each verb.
The registration function uses the scheduling mechanism to determine
the scheduling group for the verb.
This commit simply reverses the order of execution.
Fixes#6628
When compaction A completes, a request is issued so that all parallel compactions
will replace compaction A's input sstables by respective output sstables, in the
SSTable set snapshot used for expiration purposes.
That's done to allow space of input SSTables to be released as soon as possible,
helping a lot incremental compaction, but also the non-incremental approach.
Recently I came to realization that we're copying the SSTable set, when doing the
replacement, to make the code exception safe, but it turns out that if an exception
is triggered, the compaction will fail anyway. So this copy is very useless and a
potential source of reactor stalls if strategies like LCS is used.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200608192614.9354-1-raphaelsc@scylladb.com>
"
After "Make replacing node take writes" series, with repair based node
operations disabled, we saw the replace operation fail like:
```
[shard 0] init - Startup failed: std::runtime_error (unable to find
sufficient sources for streaming range (9203926935651910749, +inf) in
keyspace system_auth)
```
The reason is the system_auth keyspace has default RF of 1. It is
impossible to find a source node to stream from for the ranges owned by
the replaced node.
In the past, the replace operation with keyspace of RF 1 passes, because
the replacing node calls token_metadata.update_normal_tokens(tokens,
ip_of_replacing_node) before streaming. We saw:
```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9021954492552185543, -9016289150131785593] exists on {127.0.0.6}
```
Node 127.0.0.6 is the replacing node 127.0.0.5. The source node check in
range_streamer::get_range_fetch_map will pass if the source is the node
itself. However, it will not stream from the node itself. As a result,
the system_auth keyspace will not get any data.
After the "Make replacing node take writes" series, the replacing node
calls token_metadata.update_normal_tokens(tokens, ip_of_replacing_node)
after the streaming finishes. We saw:
```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9049647518073030406, -9048297455405660225] exists on {127.0.0.5}
```
Since 127.0.0.5 was dead, the source node check failed, so the bootstrap
operation.
Ta fix, we ignore the table of RF 1 when it is unable to find a source
node to stream.
Fixes#6351
"
* asias-fix_bootstrap_with_rf_one_in_range_streamer:
range_streamer: Handle table of RF 1 in get_range_fetch_map
streaming: Use separate streaming reason for replace operation
We were not consistent about using '#include "foo.hh"' instead of
'#include <foo.hh>' for scylla's own headers. This patch fixes that
inconsistency and, to enforce it, changes the build to use -iquote
instead of -I to find those headers.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200608214208.110216-1-espindola@scylladb.com>
There is no point to hold prepared_metadata in result_message::prepared
as a shared_ptr since their lifetime match.
Message-Id: <20200610113217.GF335449@scylladb.com>
The test test_key_condition_expression_multi() had a small typo, which
was hidden by the fact that the request was expected to fail for other
reasons, but nevertheless should be fixed.
Moreover, it appears that the Amazon DynamoDB changed their error message
for this case, so running the test with "--aws" failed. So this patch
makes it work again by being more forgiving on the exact error message.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200609205628.562351-1-nyh@scylladb.com>
In the existing Alternator code, we used std::unique_ptr<rjson::value> for
passing the optional old value of an item read for a RMW operation.
The benfit of this type over the simpler "const rjson::value*" is that it
gives the callee ownership of the item, and thus the ability to move parts
of it into the response without copying them. We only used this ability in a
handful of obscure cases involving ReturnedValues, but I am not going to
break this dubious feature in this patch.
Nevertheless, a lot of internal code, like condition checks, just needs
read-only access to that previous item, so we passed a reference to the
unique_ptr, i.e., "const std::unique_ptr<rjson::value>&". This is ugly,
and also forces new code that wants to use the same condition checks (i.e.,
filtering code), to artificially allocate a unique_ptr just because that
is what these functions expect.
So in this patch, we change the utility functions such as
verify_condition_expression() and everything they use, to pass around a
"const rjson::value*" instead of a "const std::unique_ptr<rjson::value>&.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200604131352.436506-1-nyh@scylladb.com>
Since we don't support Ubuntu 14.04 anymore, we can drop Upstart related code
from supervisor.[cc|hh].
Also, "#ifdef HAVE_LIBSYSTEMD" was for compiling Scylla on older distribution
which does not provide libsystemd, we nolonger need this since we always build
Scylla on latest Fedora.
Dropping HAVE_LIBSYSTEMD also means removing libsystemd from optional_packages
in configure.py, make it required library.
Note that we still may run Scylla without systemd such as our Docker image,
but sd_notify() does nothing when systemd does not detected, so we can ignore
such case.
Reference: https://www.freedesktop.org/software/systemd/man/sd_notify.html
Reference: https://github.com/systemd/systemd/blob/master/src/libsystemd/sd-daemon/sd-daemon.c
Amazon Linux 2 has /usr/bin/cpupower, but does not have cpupower.service
unlike CentOS7.
We need to provide the .service file when distribution is Amazon Linux 2.
Fixes#5977
Current sender sends stream_mutation_fragments_cmd::end_of_stream to
receiver when an error is received from a peer node. To be safe, send
stream_mutation_fragments_cmd::error instead of
stream_mutation_fragments_cmd::end_of_stream to prevent end_of_stream to
be written into the sstable when a partition is not closed yet.
In addition, use mutation_fragment_stream_validator to valid the
mutation fragments emitted from the reader, e.g., check if
partition_start and partition_end are paired when the reader is done. If
not, fail the stream session and send
stream_mutation_fragments_cmd::error instead of
stream_mutation_fragments_cmd::end_of_stream to isolate the problematic
sstables on the sender node.
Refs: #6478
"
This series allows for resharding SSTables (if needed) before SSTables are
moved from the upload directory, instead of after.
The infrastructure is supposed to be used soon to also load SSTables at boot.
That, however, will take a bit longer as we need to reshape resharded SSTables
for maximum benefit. That should benefit the upload directory as well, however
the current series already presents high incremental value for upload directory
and could be merged sooner (so I can focus on reshaping).
For now, this series still keep the actual moving from upload directory
to the main directory untouched. Once reshaping is ready, it will take
care of this too.
A new file with tests is introduced that tests the process of reading
SSTables from an existing directory.
dtests executed: migration_test.py (--smp 4), which previously failed
"
* 'upload-reshard-v8.1' of github.com:glommer/scylla:
load_new_sstables: reshard before scanning the upload directory
distributed_load: initial handling of off-strategy SSTables
remove manifest_file filter from table.
sstables: move open-related structures to their own file.
sstables: store data size in foreign_sstable_open_info
compaction: split compaction.hh header
In a later patch we will be able move files directly from upload
into the main directory. However for now, for the benefit of doing
this incrementally, we will first reshard in place with our new
reshard infrastructure.
load_new_sstables can then move the SSTables directly, without having
to worry about resharding. This has the immediate benefit that the
resharding happens:
- in the streaming group, without affecting compaction work
- without waiting for the current locks (which are held by compactions)
in load_new_sstables to release.
We could, at this point, just move the SSTables to the main directory
right away.
I am not doing this in this patch, and opting to keep the rest of upload
process unchanged. This will be fixed later when we enable offstrategy
compactions: we'll then compact the SSTables generated into the main
directory.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Fixes#6561
Pre-image generation in row deletion case only checked if we had a pre-image
result set row. But that can be from post-image. Also check actual existance
of the pre-image CK.
Message-Id: <20200608132804.23541-1-calle@scylladb.com>
Off-strategy SSTables are SSTables that do not conform to the invariants
that the compaction strategies define. Examples of offstrategy SSTables
are SSTables acquired over bootstrap, resharding when the cpu count
changes or imported from other databases through our upload directory.
This patch introduces a new class, sstable_directory, that will
handle SSTables that are present in a directory that is not one of the
directories where the table expects its SSTables.
There is much to be done to support off-strategy compactions fully. To
make sure we make incremental progress, this patch implements enough
code to handle resharding of SSTables in the upload directory. SSTables
are resharded in place, before we start accessing the files.
Later, we will take other steps before we finally move the SSTables into
the main directory. But for now, starting with resharding will not only
allow us to start small, but it will also allow us to start unleashing
much needed cleanups in many places. For instance, once we start
resharding on boot before making the SSTables available, we will be able
to expurge all places in Scylla where, during normal operations, we have
extra handler code for the fact that SSTables could be shared.
Tests: a new test is added and it passes in debug mode.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When we are scanning an sstable directory, we want to filter out the
manifest file in most situations. The table class has a filter for that,
but it is a static filter that doesn't depend on table for anything. We
are better off removing it and putting in another independent location.
While it seems wasteful to use a new header just for that, this header
will soon be populated with the sstable_directory class.
Tests: unit (dev)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
sstables/sstables.hh is one of our heaviest headers and it's better that we don't
include it if possible. For some users, like distributed_loader, we are mostly
interested in knowing the shape of structures used to open an SSTable.
They are:
- the entry_descriptor, representing an SSTable that we are scanning on-disk
- the sstable_open_info, representing information about a local, opened SSTable
- the foreign_sstable_open_info, representing information about an opened SSTable
that can cross shard boundaries.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the new version of resharding we'll want to spread SSTables around
the many shards based on their total size. This means we also need to
know the size of each SSTable individually.
We could wrap the foreign_sstable_info around another structure that
keeps track of that, but because this structure exists mostly for
resharding purposes anyway we will just add the data_size to it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
compaction.hh is one of our heavy headers, but some users just want to
use information on it about how to describe a compaction, not how to
perform one.
For that reason this patch splits the compaction_descriptor into a new
header.
The compaction_descriptor has, as a member type, compaction_options.
That is moved too, and brings with it the compaction_type. Both of those
structures would make sense in a separate header anyway.
The compaction_descriptor also wants the creator_fn and replacer_fn
functions. We also take this opportunity to rename them into something
more descriptive
Signed-off-by: Glauber Costa <glauber@scylladb.com>
same as redhat, makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.
See #6243
Unlike redhat version, debian version already supported cross build since
it uses debootstrap, but the shellscript rejecting to continue build on
non-debian distribution, so drop these lines to build on Fedora.
[avi: regenerate toolchain]
This is scylla-python3 version of #6611, but we also need to rename
.deb build directory for scylla-python3, since we may lose .deb when
building both scylla and scylla-python3 .deb package, since we currently
sharing build directory.
So renamed it to build/python3/debian.
On 287d6e5, we stopped to rm -rf debian/ on build_deb.sh, since now we have
prebuilt debian/ directory.
However, it might cause .deb build error when we modified debian package source,
since it never cleanup.
To prevent build error, we need to cleanup build/debian on reloc/build_deb.sh,
before extracting contents from relocatable package.
When reclaiming segments to the seastar the code tries to free the segments
sequentially. For this it walks the segments from left to right and frees
them, but every time a non-empty segment is met it gets migrated to another
segment, that's allocated from the right end of the list.
This is waste of cycles sometimes. The destination segment inherits the
holes from the source one, and thus it will be compacted some time in the
future. Why not compact it right at the reclamation time? It will take the
same time or less, but will result in better compaction.
To acheive this, the segment to be reclaimed is compacted with the existing
compact_segment_locked() code with some special care around it.
1. The allocation of new segments from seastar is locked
2. The reclaiming of segments with evict-and-compact is locked as well
3. The emergency pool is opened (the compaction is called with non-empty
reserve to avoid bad_alloc exception throw in the middle of compaction)
4. The segment is forcibly removed from the histogram and the closed_occupancy
is updated just like it is with general compaction
The segments-migration auxiliary code can be removed after this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lock disables the segment_pool to call for more segments from
the underlying allocator.
To be used in next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This includes 3 small changes to facilitate next patching:
- rename region::impl::compact into compact_segment_locked
- merging former compact with compact_single_segment_locked
- moving log print and stats update into compact_segment_locked
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
storage_proxy is never deinitialized, so it may have still used cdc_service
after its destructor was called.
This fixes the problem by cdc_service inheriting from
async_sharded_service and storage_proxy calling shared_from_this on
the service whenever it uses it.
cdc_service inherits from async_sharded_service and not simply from
enable_shared_from_this, because there might be other services that
cdc_service depends on. Assuming that these services are
deinitialized after cdc_service (as they should), i.e. after stop() is
called on cdc_service, making cdc_service async_sharded_service will
keep their deinitialization code from being called until all references
to cdc_service disappear (async_sharded_service keeps stop() from
returning until this happens).
Some more improvements should be possible through some refactoring:
1. Make augment_mutation_call a free function, not a member of
cdc_service: it doesn't need any state that cdc_service has.
db_context can be passed down from storage_proxy when it calls the
function.
2. Remove the storage_proxy -> cdc_service reference. storage_proxy
only needs augment_mutation_call, which would not be a part of the
service. This would also get rid of the proxy -> cdc -> proxy
reference cycle that we have now, and would allow storage_proxy to be
safely deinitialized after cdc_service.
3. Maybe we could even remove the cdc_service -> storage_proxy
reference. Is it really needed?
The tracker::impl::reclaim is already in reclaim-locked
section, no need for yet another nested lock.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When calling alloc_small the migrator is passed just to get the
object descriptor, but during compaction the descriptor is already
at hands, so no need to re-get it again.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When a replacing node is in early boot up and is not in HIBERNATE sate
yet, if the node is killed by a user, the node will wrongly send a
shutdown message to other nodes. This is because UNKNOWN is not in
SILENT_SHUTDOWN_STATES, so in gossiper::do_stop_gossiping, the node will
send shutdown message. Other nodes in the cluster will call
storage_service::handle_state_normal for this node, since NORMAL and
SHUTDOWN status share the same status handler. As a result, other nodes
will incorrectly think the node is part of the cluster and the replace
operation is finished.
Such problem was seen in replace_node_no_hibernate_state_test dtest:
n1, n2 are in the cluster
n2 is dead
n3 is started to replace n2, but n3 is killed in the middle
n3 announces SHUTDOWN status wrongly
n1 runs storage_service::handle_state_normal for n3
n1 get tokens for n3 which is empty, because n3 hasn't gossip tokens yet
n1 skips update normal tokens for n3, but think n3 has replaced n2
n4 starts to replace n2
n4 checks the tokens for n2 in storage_service::join_token_ring (Cannot
replace token {} which does not exist!) or
storage_service::prepare_replacement_info (Cannot replace_address {}
because it doesn't exist in gossip)
To fix, we add UNKNOWN into SILENT_SHUTDOWN_STATES and avoid sending
shutdown message.
Tests: replace_address_test.py:TestReplaceAddress.replace_node_no_hibernate_state_test
Fixes: #6436
It seems that the following functions are never used, delete them:
* `function::has_reference_to`
* `functions::get_overload_count`
* `to_identifiers` in column_identifier.hh
* `single_column_relation::get_map_key`
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200606115149.1770453-1-pa.solodovnikov@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6516 from
Piotr Sarna:
This series adds error injection points to materialized view paths:
view update generation from staging sstables;
view building;
generating view updates from user writes.
This series comes with a corresponding dtest pull request which adds some
test cases based on error injection.
Fixes#6488
"
Now we generate dist/changelog on relocatable package generation time,
we cannot run '.rc' fixup on .deb package building time, need to do it
in debian_files_gen.py.
Also, we uses '_' in version number for some test version packages,
which does not supported in .deb packaging system, need to replaced
with '-'.
"
* syuu1228-debian_version_number_fix:
dist/debian: support version number containing '_'
dist/debian: move version number fixup to debian_files_gen.py
In current mutate_MV() code it's possible for a local endpoint
to become a target for a network operation. That's the source
of occasional `broken promise` benign error messages appearing,
since the mutation is actually applied locally, so there's no point
in creating a write response handler - the node will not send a response
to itself via network.
While at it, the code is deduplicated a little bit - with the paths
simplified, it's easier to ensure that a local endpoint is never
listed as a target for remote network operations.
Fixes#5459
Tests: unit(dev),
dtest(materialized_views_test.TestMaterializedViews.add_dc_during_mv_insert_test)
Overwriting a collection cell using timestamp T is a process with
following steps:
1. inserting a row marker (if applicable) with timestamp T;
2. writing a collection tombstone with timestamp T-1;
3. writing the new collection value with timestamp T.
Since CDC does clustering of the operations by timestamp, this
would result in 3 separate calls to `transform` (in case of
INSERT, or 2 - in the case of UPDATE), which seems excessive,
especially when pre-/postimage is enabled. This patch makes
collection tombstones being treated as if they had the same TS as
the base write and thus they are processed in one call to `transform`
(as long as TTLs are not used).
Also, `cdc_test` had to be updated in places that relied on former
splitting strategy.
Fixes#6084
For tombstone expiration to proceed correctly without the risk of resurrecting
data, the sstable set must be present.
Regular compaction and derivatives provide the sstable set, so they're able
to expire tombstones with no resurrection risk.
Resharding, on the other hand, can run on any shard, not necessarily on the
same shard that one of the input sstables belongs to, so it currently cannot
provide a sstable set for tombstone expiration to proceed safely.
That being said, let's only do expiration based on the presence of the set.
This makes room for the sstable set to be feeded to compaction via descriptor,
allowing even resharding to do expiration. Currently, compaction thinks that
sstable set can only come from the table, and that also needs to be changed
for further flexibility.
It's theoretically possible that a given resharding job will resurrect data if
a fully expired SSTable is resharded at a shard which it doesn't belong to.
Resharding will have no way to tell that expiring all that data will lead to
resurrection because the relevant SSTables are at different shards.
This is fixed by checking for fully expired sstables only on presence of
the sstable set.
Fixes#6600.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200605200954.24696-1-raphaelsc@scylladb.com>
The command regenerates streams when:
- generations corresponding to a gossiped timestamp cannot be
fetched from `system_distributed` table,
- or when generation token ranges do not align with token metadata.
In such case the streams are regenerated and new timestamp is
gossiped around. The returned JSON is always empty, regardless of
whether streams needed regeneration or not.
Now we generate dist/changelog on relocatable package generation time,
we cannot run '.rc' fixup on .deb package building time, need to do it
in debian_files_gen.py.
Commit 968177da04 has changed the schema
of cdc_topology_description and cdc_description tables in the
system_distributed keyspace.
Unfortunately this was a backwards-incompatible change: these tables
would always be created, irrespective of whether or not "experimental"
was enabled. They just wouldn't be populated with experimental=off.
If the user now tries to upgrade Scylla from a version before this change
to a version after this change, it will work as long as CDC is protected
b the experimental flag and the flag is off.
However, if we drop the flag, or if the user turns experimental on,
weird things will happen, such as nodes refusing to start because they
try to populate cdc_topology_description while assuming a different schema
for this table.
The simplest fix for this problem is to rename the tables. This fix must
get merged in before CDC goes out of experimental.
If the user upgrades his cluster from a pre-rename version, he will simply
have two garbage tables that he is free to delete after upgrading.
sstables and digests need to be regenerated for schema_digest_test since
this commit effectively adds new tables to the system_distributed keyspace.
This doesn't result in schema disagreement because the table is
announced to all nodes through the migration manager.
from Juliusz.
CDC for counters is unimplemented as of now,
therefore any attempt to enable CDC log on counter
table needs to be clearly disallowed. This patch does
exactly this.
The check whether schema has counter columns
is performed in `cdc_service::impl` in:
- `on_before_create_column_family`,
- `on_before_update_column_family`
and, if so, results in `invalid_request_exception` thrown.
Fixes#6553
* jul-stas-6553-disallow-cdc-for-counters:
test/cql: Check that CDC for counters is disallowed
CDC: Disallowed CDC for tables with counter column(s)
This patch adds a test reproducing issue #6572, where the perfectly
good condition expression:
#name1 = :val1 OR #name2 = :val2
Gets refused because of the following combination in our implementation:
1. Short-circuit evaluation, i.e., after we discover #name1 = :val1
we don't evaluate the second half of the expression.
2. The list of "used" references is collected at evaluation time,
instead of at parsing time. Because evaluation never reaches
#name2 (or :val2) our implementation complains that they are not
used, and refuses the request - which should have been allowed.
This test xfails on Alternator. It passes on DynamoDB.
Refs #6572
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200604171954.444291-1-nyh@scylladb.com>
While not very interesting by itself, the test case shows
that in case of TagResource and UntagResource it's actually correct
to return empty HTTP body instead of an empty JSON object,
which was the case for PutItem.
Message-Id: <6331963179c5174a695f0e9eeed17de6c9f9a3be.1591269516.git.sarna@scylladb.com>
The DynamoDB GetItem request returns the requested item in a specific way,
wrapped in a map with a "Item" member. For historic reasons, we used the
same function that returns this (describe_item()) also in other code which
reads items - e.g. for checking conditional operations. The result is
wasteful - after adding this "Item" member we had other code to extract it,
all for no good reason. It is also ugly and confusing.
Importantly, this situation also makes it harder for me to add support for
FilterExpression. The issue is that the expression evaluator got the item
with the wrapper (from the existing ConditionExpression code) but the
filtering code had it without this wrapper, as it didn't use describe_item().
So this patch uses describe_single_item(), which doesn't add the wrapper
map, instead of describe_item(). The latter function is used just once -
to implement GetItem. The unnecessary code to unwrap the item in multiple
places was then dropped.
All the tests still pass. I also tested test_expected.py in unsafe_rmw write
isolation mode, because code only for this mode had to be modified as well.
Refs #5038.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200604092050.422092-1-nyh@scylladb.com>
Correct the compatibility section in docs/alternator/alternator.md:
Filtering of Scan/Query results using the older syntax (ScanFilter,
QueryFilter) is, after commit bea9629031,
now fully supported. The newer syntax (FilterExpression) is not yet.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200604073207.416860-1-nyh@scylladb.com>
Recently ./reloc/build_deb.sh started failing with
dpkg-source: info: using source format '1.0'
dpkg-source: info: building scylla-python3 using existing scylla-python3_3.8.3-0.20200604.77dfa4f15.orig.tar.gz
dpkg-source: info: building scylla-python3 in scylla-python3_3.8.3-0.20200604.77dfa4f15-1.diff.gz
dpkg-source: error: cannot represent change to scylla-python3/lib64/python3.8/site-packages/urllib3/packages/backports/__pycache__/__init__.cpython-38.pyc:
dpkg-source: error: new version is plain file
dpkg-source: error: old version is symlink to /usr/lib/python3.8/site-packages/__pycache__/six.cpython-38.pyc
dpkg-source: error: unrepresentable changes to source
dpkg-buildpackage: error: dpkg-source -b . subprocess returned exit status 1
debuild: fatal error at line 1182:
Those files are not in fact symlinks, so it's clear that dpkg is confused
about something. Rather than debug dpkg, however, it's easier to just
drop __pycache__ directories. These hold the result of bytecode
compilation and are therefore optional, as Python will compile the sources
if the cache is not populated.
Fixes#6584.
In 28c3d4 `out()` was used without `shell=True` and was the spliting of arguments
failed cause of the complex commands in the cmd (pipe and such)
Fixes#6159
"
The new seastar api changes make_file_output_stream and
make_file_data_sink to return futures. This series includes a few
refactoring patches and the actual transition.
"
* 'espindola/api-v3-v3' of https://github.com/espindola/scylla:
table: Fix indentation
everywhere: Move to seastar api level 3
sstables: Pass an output_stream to make_compressed_file_.*_format_output_stream
sstables: Pass a data_sink to checksummed_file_writer's constructor
sstables: Convert a file_writer constructor to a static make
sstables: Move file_writer constructor out of line
This is a bit simpler as we don't have to pass in the options and
moves the calls to make_file_output_stream to places where we can
handle futures.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
checksummed_file_writer cannot be moved, so we can't have a
checksummed_file_writer::make that returns a future. So instead we
pass in a data_sink and let the callers call make_file_data_sink.
This is in preparation for make_file_data_sink returning a future in
the seastar api v3.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
For now it always returns a ready future. This is in preparation for
using seastar v3 api where make_file_output_stream returns a future.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Until we get implementation of CDC for counters, we explicitly
disallow it. The check is performed in `cdc_service::impl` in:
- `on_before_create_column_family`,
- `on_before_update_column_family`
and results in `invalid_request_exception` thrown.
* seastar 9066edd512...42e770508c (15):
> Revert "sharded: constrain sharded::map_reduce0"
> tls: Fix race/unhandled case in reloadable_certificates
> fair_queue: rename operator< to strictly_less
> future: Add a current_exception_future_marker
> Merge "Avoid passing non nothrow move constructible lambdas to future::then" from Rafael
> tls_echo_server_demo: main: capture server post stop()
> tests: fstream: remove obsolete comments about running in background
> everywhere: Reopen inline namespaces as inline
> Merge "Merge the two do_with implementations" from Rafael
> sharded: constrain sharded::map_reduce0
> Merge "Backtracing across tasks" from Tomasz
> posix-stack: fix strict aliasing violations on CMSG_DATA(cmsghdr)
> sharded: unify invoke_on_*() variants
> sharded_parameter_demo: Delete unused member variable
> futures_test: Fix delete of copy constructor
The querier cache expects all querier objects it stores to have certain
methods. To avoid accessing these via `std::visit()` (the querier object
is stored in an `std::variant`), we move all the stuff that is common to
all querier types into a base class. The querier cache now accesses the
members via a reference to this common base. Additionally the variant is
eliminated completely and the cache entry stores an
`std::unique_ptr<querier_base>` instead.
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200603152544.83704-1-bdenes@scylladb.com>
After 7f1a215, a sstable is only added to backlog tracker if
sstable::shared() returns true.
sstable::shared() can return true for a sstable that is actually owned
by more than one shard, but it can also incorrectly return true for
a sstable which wasn't made explicitly unshared through set_unshared().
A recent work of mine is getting rid of set_unshared() because a
sstable has the knowledge to determine whether or not it's shared.
The problem starts with streaming sstable which hasn't set_unshared()
called for it, so it won't be added to backlog tracker, but it can
be eventually removed from the tracker when that sstable is compacted.
Also, it could happen that a shared sstable, which was resharded, will
be removed from the tracker even though it wasn't previously added.
When those problems happen, backlog tracker will have an incorrect
account of total bytes, which leads it to producing incorrect
backlogs that can potentially go negative.
These problems are fixed by making every add / removal go through
functions which take into account sstable::shared().
Fixes#6227.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200512220226.134481-2-raphaelsc@scylladb.com>
New SStables are only added to backlog tracker if set_unshared() was
called on their behalf. SStables created for streaming are not being
added to the tracker because make_streaming_sstable_for_write()
doesn't call set_unshared() nor does it caller. Which results in backlog
not accounting for their existence, which means backlog will be much
lower than expected.
This problem could be fixed by adding a set_unshared() call but it
turns out we don't even need set_unshared() anymore. It was introduced
when Scylla metadata didn't exist, now a SSTable has built-in knowledge
of whether or not it's shared. Relying on every SSTable creator calling
set_unshared() is bug prone. Let's get rid of it and let the SStable
itself say whether or not it's shared. If an imported SSTable has not
Scylla metadata, Scylla will still be able to compute shards using
token range metadata.
Refs #6021.
Refs #6227.
Fixes#6441.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200512220226.134481-1-raphaelsc@scylladb.com>
Add Paxos error injections before/after save promise, proposal, decision,
paxos_response_handler, delete decision.
Adds a method to inject an error providing a lambda while avoiding to add
a continuation when the error injection is disabled.
For this provide error exception and enter() to allow flow control (i.e. return)
on simple error injections without lambdas.
Also includes Pavel's patch for CQL API for error injections, updated to
current error injection API and added one_shot support. Also added some
basic CQL API boost tests.
For CQL API there's a limitation of the current grammar not supporting
f(<terminal>) so values have to be inserted in a table until this is
resolved. See #5411
* https://github.com/alecco/scylla/tree/error_injection_v11:
paxos: fix indentation
paxos: add error injections
utils: add timeout error injection with lambda
utils: error injection add enter() for control flow
utils: error injections provide error exceptions
failure_injector: implement CQL API for failure injector class
lwt: fix disabled error injection templates
Even if there are no attributes to return from PutItem requests,
we should return a valid JSON object, not an empty string.
Fixes#6568
Tests: unit(dev)
Client libraries (e.g. PynamoDB) expect the UnprocessedKeys
and UnprocessedItems attributes to appear in the response
unconditionally - it's hereby added, along with a simple test case.
Fixes#6569
Tests: unit(dev)
Even though calling then() on a ready future does not allocate a
continuation, calling then on the result of it will allocate.
This error injection only adds a continuation in the dependency
chain if error injections are enabled at compile timeand this particular
error injection is enabled.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
For control flow (i.e. return) and simplicity add enter() method.
For disabled injections, this method is const returning false,
therefore it has no overhead.
Add boost test.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
This patch implements the missing QueryFilter (and ScanFilter)
functionality:`
1. All operators. Previously, only the "EQ" operator was implemented.
2. Either "OR" or "AND" of conditions (previously only "AND").
3. Correctly returning Count and ScannedCount for post-filter and
pre-filter item counts, respectively.
All of the previously-xfailing tests in test_query_filter.py are now
passing.
The implementation in this patch abandons our previous attempts to
translate the DynamoDB API filters into Scylla's CQL filters.
Doing this correctly for all operators would have been exceedingly
difficult (for reasons explained in #5028), and simply not worth the
effort: CQL's filters receive a page of results and then filter them,
and we can do exactly the same without CQL's filters:
The new code just retrieves an unfiltered page of items, and then for
each of these items checks whether it passes the filters. The great thing
is that we already had code for this checking - the QueryFilter syntax is
identical to the "Expected" syntax (for conditional operations) that
we already supported, so we already had code for checking these conditions,
including all the different operators.
This patch prepares for the future need to support also the newer
FilterExpression syntax (see issue #5038), and the "filter" class
supports either type of filter - the implementation for the second
syntax is just missing and can be added (fairly easily) later.
Fixes#5028.
Refs #5038.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200603110118.399325-1-nyh@scylladb.com>
This fixes a bug in CDC mutation augmentation logic. A lambda that is
called for each partition key in a batch captures a trace state pointer,
but moves it out after being called for the first time. This caused CDC
tracing information to be included only for one of the partition keys
of the batch.
Fixes#6575
We implemented the order operators (LT, GT, LE, GE, BETWEEN) incorrectly
for binary attributes: DynamoDB requires that the bytes be treated as
unsigned for the purpose of order (so byte 128 is higher than 127), but
our implementation uses Scylla's "bytes" type which has signed bytes.
The solution is simple - we can continue to use the "bytes" type, but
we need to use its compare_unsigned() function, not its "<" operator.
This bug affected conditional operations ("Expected" and
"ConditionExpression") and also filters ("QueryFilter", "ScanFilter",
"FilterExpression"). The bug did *not* affect Query's key conditions
("KeyConditions", "KeyConditionExpression") because those already
used Scylla's key comparison functions - which correctly compare binary
blobs as unsigned bytes (in fact, this is why we have the
compare_unsigned() function).
The patch also adds tests that reproduce the bugs in conditional
operations, and show that the bug did not exist in key conditions.
Fixes#6573
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200603084257.394136-1-nyh@scylladb.com>
To make unified relocatable package easily, we may want to merge tarballs to single tarball like this:
zcat *.tar.gz | gzip -c > scylla-unified.tar.xz
But it's not possible with current relocatable package format, since there are multiple files conflicts, install.sh, SCYLLA-*-FILE, dist/, README.md, etc..
To support this, we need to archive everything in the directory when building relocatable package.
This is modifying relocatable package format, we need to provide a way to
detect the format version.
To do this, we added a new file ".relocatable_package_version" on the top of the
archive, and set version number "2" to the file.
Fixes#6315
We generate a coredump as part of "scylla_coredump_setup" to verify that
coredumps are working. However, we need to *remove* that test coredump
to avoid people and test infrastructure reporting those coredumps.
Fixes#6159
This test (which passes successfully on both Alternator and DynamoDB)
was written to confirm our understanding of how the *paging* feature
works.
Our understanding, based on DynamoDB documentation, has been that the
"Limit" parameter determines the number of pre-filtering items, *not*
the actual number of items returned after having passed the filter.
So the number of items actually returned may be lower than Limit - in
some cases even zero.
This test tries an extreme case: We scan a collection of 20 items with
a filter matching only 10 (or so) of them, with Limit=1, and count
the number of pages that we needed to request until collecting all these
10 (or so) matches. We note that the result is 21 - i.e., DynamoDB and
Alternator really went through the 20 pre-filtering items one by one,
and for the items which didn't match the filter returned an empty page.
The last page (the 21st) is always empty: DynamoDB or Alternator doesn't
know whether or not there is a 21st item, and it takes a 21st request
to discover there isn't.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200602145015.361694-1-nyh@scylladb.com>
This test reproduces a bug in the current implementation of
QueryFilter, which returns for ScannedCount the count of
post-filter items, whereas it should return the pre-filter
count.
The test tests both ScannedCount and Count, when QueryFilter
is used and when it isn't used.
The test currently xfails on Alternator, passes on DynamoDB.
Refs #5028
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200602125924.358636-1-nyh@scylladb.com>
"Currently in coredump setup, we enabled a systemd mount to mount default
coredump directory to /var/lib/scylla/coredump, but we didn't start it.
So the coredump will still be saved to default coredump directory
before a system reboot, it might touch enospc problem.
One patch started the systemd mount during coredump setup, and make the
mount effect. Another patch improved the error message of systemd
unit, it's confused when the unit config is invalid."
Fixes#6566
* 'coredump_conf' of git://github.com/amoskong/scylla:
scylla_util/systemd_unit: improve the error message
active the coredump directory mount during coredump setup
we always raise exception 'Unit xxx not found' when exception is raised in
executing 'systemctl cat xxx'. Sometimes the error is confused.
On OEL7, the 'systemctl cat var-lib-systemd-coredump.mount' will also verify
the config content, scylla_coredump_setup failed for that the config file
is invalid, but the error is 'unit var-lib-systemd-coredump.mount not found'.
This patch improved the error message.
Related issue: https://github.com/scylladb/scylla/issues/6432
Currently we use a systemd mount (var-lib-systemd-coredump.mount) to mount
default coredump directory (/var/lib/systemd/coredump) to
(/var/lib/scylla/coredump). The /var/lib/scylla had been mounted to a big
storage, so we will have enough space for coredump after the mount.
Currently in coredump_setup, we only enabled var-lib-systemd-coredump.mount,
but not start it. The directory won't be mounted after coredump_setup, so the
coredump will still be saved to default coredump directory.
The mount will only effect after reboot.
Fixes#6566
This reverts commit e77dad3adf because its
incorrect.
Amos explains:
"Quote from https://www.freedesktop.org/software/systemd/man/systemd.mount.html
What=
Takes an absolute path of a device node, file or other resource to
mount. See mount(8) for details. If this refers to a device node, a
dependency on the respective device unit is automatically created.
Where=
Takes an absolute path of a file or directory for the mount point; in
particular, the destination cannot be a symbolic link. If the mount
point does not exist at the time of mounting, it is created as
directory.
So the mount point is '/var/lib/systemd/coredump' and
'/var/lib/scylla/coredump' is the file to mount, because /var/lib/scylla
had mounted a second big storage, which has enough space for Huge
coredumps.
Bentsi or other touched problem with old scylla-master AMI, a coredump
occurred but not successfully saved to disk for enospc. The directory
/var/lib/systemd/coredump wasn't mounted to /var/lib/scylla/coredump.
They WRONGLY thought the wrong mount was caused by the config problem,
so he posted a fix.
Actually scylla-ami-setup / coredump wasn't executed on that AMI, err:
unit scylla-ami-setup.service not found Because
'scylla-ami-setup.service' config file doesn't exist or is invalid.
Details of my testing: https://github.com/scylladb/scylla/issues/6300#issuecomment-637324507
So we need to revert Bentsi's patch, it changed the right config to wrong."
The comparison operator (<=>) default implementation happens to exactly
match tombstone::compare(), so use the compiler-generated defaults. Also
default operator== and operator!= (these are not brought in by operator<=>).
These become slightly faster as they perform just an equality comparison,
not three-way compare.
shadowable_tombstone and row_tombstone depend on tombstone::compare(),
so convert them too in a similar way.
with_relational_operations.hh becomes unused, so delete it.
Tests: unit (dev)
Message-Id: <20200602055626.2874801-1-avi@scylladb.com>
Seastar recently lost support for the experimental Concepts Technical
Specification (TS) and gained support for C++20 concepts. Re-enable
concepts in Scylla by updating our use of concepts to the C++20
standard.
This change:
- peels off uses of the GCC6_CONCEPT macro
- removes inclusions of <seastar/gcc6-concepts.hh>
- replaces function-style concepts (no longer supported) with
equation-style concepts
- semicolons added and removed as needed
- deprecated std::is_pod replaced by recommended replacement
- updates return type constraints to use concepts instead of
type names (either std::same_as or std::convertible_to, with
std::same_as chosen when possible)
No attempt is made to improve the concepts; this is a specification
update only.
Message-Id: <20200531110254.2555854-1-avi@scylladb.com>
Merged patch series by Piotr Sarna:
This series migrates the regex-based implementation of big decimal
parsing to a more efficient one, based on string views.
The series originated as a single patch, but was later
extended by more tests and a microbenchmark.
Perf results, comparing the old implementation, the new one,
and the experimental one from v2 of this series are here:
test iterations median mad min max
Regex: 88895 11.228us 25.891ns 11.202us 11.510us
String view: 232334 4.303us 21.660ns 4.282us 4.736us
State machine (experimental, ditched):
148318 6.723us 51.896ns 6.672us 6.877us
Tests: unit(dev)
Piotr Sarna (4):
big_decimal: migrate to string views
test: add test cases to big_decimal_test
test/lib: add generating random numeric string
test: add big_decimal perf test
configure.py | 1 +
test/boost/big_decimal_test.cc | 29 +++++++++++++++++++
test/lib/make_random_string.hh | 11 +++++++
test/perf/perf_big_decimal.cc | 52 ++++++++++++++++++++++++++++++++++
utils/big_decimal.cc | 51 ++++++++++++++++++++++-----------
5 files changed, 127 insertions(+), 17 deletions(-)
Test cases for big decimals were quite complete, but since the
implementation was recently changed, some corner cases are added:
- incorrect strings
- numbers not fitting into uint64_t
- numbers less than uint64_t::max themselves, but with the unscaled
value exceeding the maximum
Big decimals are, among other use cases, used as a main number
type for alternator, and as such can appear on the fast path.
Parsing big decimals was performed via std::regex, which is not
precisely famous for its speeds, and also enforces unnecessary
string copying. Therefore, the implementation is replaced
with an open-coded version based on string_views.
One previous iteration of this series also included
a hand-coded state machine implementation, but it proved
to be slower than the slightly naive string_view one.
Overall, execution time is reduced by 61.6% according to
microbenchmarks, which sounds like a promising improvement.
Perf results:
test iterations median mad min max
Regex (original):
big_decimal_test.from_string 88895 11.228us 25.891ns 11.202us 11.510us
String view (new):
big_decimal_test.from_string 232334 4.303us 21.660ns 4.282us 4.736us
State machine (experimental, ditched):
big_decimal_test.from_string 148318 6.723us 51.896ns 6.672us 6.877us
Tests: unit(dev + release(big_decimal_test))
Get the table names from the table ids instead which prevents the user
of repair_info class provides inconsistent table names and table ids.
Refs: #5942
Now that repair_info has tables id for the tables we want to repair. Use
table_id instead of table_name in row level repair to find a table. It
guarantees we repair the same table even if a table is dropped and a new
table is created with the same name.
Refs: #5942
A helper get_table_ids is added to convert the table names to table ids.
We convert it once and use the same table ids for the whole repair
operations. This guarantees we repair the same table during the same
repair request.
Refs: #5942
"
This is a combined set of tiny cleanups that has been
collected for the past few monthes. Mostly about removing
storage_service.hh inclusions here and there.
tests: unit(dev), headers compilation
"
* 'br-storage-service-cleanups-a' of https://github.com/xemul/scylla:
storage_service: Remove some inclusions of its header
storage_service: Move get_generation_number to util/
streaming: Get local db with own helper
streaming: Fix indentation after previous patch
streaming: Do not explicitly switch sched group
This is purely utility helper routine. As a nice side effect the
inclusion of storage_service.hh is removed from several unrelated
places.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a static global instance of needed services and helpers
for it in streaming code. This is not great to use them, but at
least this change unifies different pieces of streaming code and
removes the storage_service.hh from streaming_session.cc (the
streaming_sessio.hh doesn't include it either).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is continuation of ac998e95 -- the sched group is
switched by messaging service for a verb, no need to do
it by hands.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Boost test macros are not thread safe, using them from multiple threads
results in garbled XML test report output.
3f1823a4f0 replaced most of the
thread-unsafe boost test macros in multishard_mutation_query_test, but
one still managed to slip through the cracks. This patch removes that as
well.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200529130706.149603-3-bdenes@scylladb.com>
since dbuild was updated to fedora-32, hence to python3.8
`platform.dist()` is deprecated, and need to be replaced
Fixes: #6501
[avi: folded patch with install-dependencies.sh change]
[avi: regenerated toolchain]
Sadly, std::ranges is missing an equivalent of boost::copy_range(), so
we introduce a replacement: ranges::to(). There is an existing proposal
to introduce something similar to the standard library:
std::ranges::to() (https://github.com/cplusplus/papers/issues/145). We
name our own version similarly, so if said proposal makes it in we can
just prepend std:: and be good.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200529141407.158960-2-bdenes@scylladb.com>
This patch changes the signatures of `test_assignment` and
`test_all` functions to accept `cql3::column_specification` by
const reference instead of shared pointer.
Mostly a cosmetic change reducing overall shared_ptr bloat in
cql3 code.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200529195249.767346-1-pa.solodovnikov@scylladb.com>
Compaction is checking for abortion whenever it's consuming a new partition.
The problem with this approach is that the abortion can take too long if
compaction is working with really large partitions. If the current partition
takes minutes to be compacted, it means that abortion may be delayed by
a factor of minutes as well.
Truncate, for example, relies on this abortion mechanism, so it could happen
that the operation would take much longer than expected due to this
ineffiency, probably result in timeouts in the user side.
To fix this, it's clear that we need to increase the frequency at which
we check for abortion requests. More precisely, we need to do it not only
on partition granularity, but also on row granularity.
Fixes#6309.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200529172847.44444-1-raphaelsc@scylladb.com>
The timer.stop() call, that reports not only the time-taken, but also
the reclaimation rate, was unintentionally dropped while expanding its
scope (c70ebc7c).
Take it back (and mark the compact_and_evict_locked as private while
at it).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200528185331.10537-1-xemul@scylladb.com>
Currently test.py has three different places it checks whether stdout is
a tty. This patch centralizes these into a single global variable. This
ensures consistency and makes it easier to override it later with a
command-line switch, should we want to.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200529101124.123925-1-bdenes@scylladb.com>
Instead of doing 3 smp::invoke_on_all-s and duplicating
tracker::impl API for the tracker itself, introduce the
tracker::configure, simplify the tracker configuration
and narrow down the public tracker API.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200528185442.10682-1-xemul@scylladb.com>
"
Currently we classify queries as "system" or "user" based on the table
they target. The class of a query determines how the query is treated,
currently: timeout, limits for reverse queries and the concurrency
semaphore. The catch is that users are also allowed to query system
tables and when doing so they will bypass the limits intended for user
queries. This has caused performance problems in the past, yet the
reason we decided to finally address this is that we want to introduce a
memory limit for unpaged queries. Internal (system) queries are all
unpaged and we don't want to impose the same limit on them.
This series uses scheduling groups to distinguish user and system
workloads, based on the assumption that user workloads will run in the
statement scheduling group, while system workloads will run in the main
(or default) scheduling group, or perhaps something else, but in any
case not in the statement one. Currently the scheduling group of reads
and writes is lost when going through the messaging service, so to be
able to use scheduling groups to distinguish user and system reads this
series refactors the messaging service to retain this distinction across
verb calls. Furthermore, we execute some system reads/writes as part of
user reads/writes, such as auth and schema sync. These processes are
tagged to run in the main group.
This series also centralises query classification on the replica and
moves it to a higher level. More specifically, queries are now
classified -- the scheduling group they run in is translated to the
appropriate query class specific configuration -- on the database level
and the configuration is propagated down to the lower layers.
Currently this query class specific configuration consists of the reader
concurrency semaphore and the max memory limit for otherwise unlimited
queries. A corollary of the semaphore begin selected on the database
level is that the read permit is now created before the read starts. A
valid permit is now available during all stages of the read, enabling
tracking the memory consumption of e.g. the memtable and cache readers.
This change aligns nicely with the needs of more accurate reader memory
tracking, which also wants a valid permit that is available in every layer.
The series can be divided roughly into the following distinct patch
groups:
* 01-02: Give system read concurrency a boost during startup.
* 03-06: Introduce user/system statement isolation to messaging service.
* 07-13: Various infrastructure changes to prepare for using read
permits in all stages of reads.
* 14-19: Propagate the semaphore and the permit from database to the
various table methods that currently create the permit.
* 20-23: Migrate away from using the reader concurrency semaphore for
waiting for admission, use the permit instead.
* 24: Introduce `database::make_query_config()` and switch the database
methods needing such a config to use it.
* 25-31: Get rid of all uses of `no_reader_permit()`.
* 32-33: Ban empty permits for good.
* 34: querier_cache: use the queriers' permits to obtain the semaphore.
Fixes: #5919
Tests: unit(dev, release, debug),
dtest(bootstrap_test.py:TestBootstrap.start_stop_test_node), manual
testing with a 2 node mixed cluster with extra logging.
"
* 'query-class/v6' of https://github.com/denesb/scylla: (34 commits)
querier_cache: get semaphore from querier
reader_permit: forbid empty permits
reader_permit: fix reader_resources::operator bool
treewide: remove all uses of no_reader_permit()
database: make_multishard_streaming_reader: pass valid permit to multi range reader
sstables: pass valid permits to all internal reads
compaction: pass a valid permit to sstable reads
database: add compaction read concurrency semaphore
view: use valid permits for reads from the base table
database: use valid permit for counter read-before-write
database: introduce make_query_class_config()
reader_concurrency_semaphore: remove wait_admission and consume_resources()
test: move away from reader_concurrency_semaphore::wait_admission()
reader_permit: resource_units: introduce add()
mutation_reader: restricted_reader: work in terms of reader_permit
row_cache: pass a valid permit to underlying read
memtable: pass a valid permit to the delegate reader
table: require a valid permit to be passed to most read methods
multishard_mutation_query: pass a valid permit to shard mutation sources
querier: add reader_permit parameter and forward it to the mutation_source
...
GC writer, used for incremental compaction, cannot be currently used if interposer
consumer is used. That's because compaction assumes that GC writer will be operated
only by a single compaction writer at a given point in time.
With interposer consumer, multiple writers will concurrently operate on the same
GC writer, leading to race condition which potentially result in use-after-free.
Let's disable GC writer if interposer consumer is enabled. We're not losing anything
because GC writer is currently only needed on strategies which don't implement an
interposer consumer. Resharding will always disable GC writer, which is the expected
behavior because it doesn't support incremental compaction yet.
The proper fix, which allows GC writer and interposer consumer to work together,
will require more time to implement and test, and for that reason, I am postponing
it as #6472 is a showstopper for the current release.
Fixes#6472.
tests: mode(dev).
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200526195428.230472-1-raphaelsc@scylladb.com>
The ScanFilter and QueryFilter features are only partially implemented.
Most of their unimplemented features cause clear errors telling the user
of the unimplemented feature, but one exception is the ConditionalOperator
parameter, which can be used to "OR", instead of the default "AND", of
several conditions. Before this patch, we simply ignored this parameter -
causing wrong results to be returned instead of an error.
In this patch, ScanFilter and QueryFilter parse, instead of ignoring, the
ConditionalOperator. The common implementation, get_filtering_restrictions(),
still does not implement the OR case, but returns an error if we reach
this case instead of just ignoring it.
There is no new test. The existing test_query_filter.py::test_query_filter_or
xfailed before this patch, and continues to xfail after it, but the failure
is different (you can see it by running the test with "--runxfail"):
Before this patch, the failure was because of different results. After this
patch, the failure is because of an "unimplemented" error message.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200528214721.230587-2-nyh@scylladb.com>
The code for parsing the ConditionalOperator attribute was used once in
for the "Expected" case, but we will also need it for the "QueryFilter" and
"ScanFilter" cases, so let's extract it into a function,
get_conditional_operator().
While doing this extraction, I also noticed a bug: when Expected is missing,
ConditionalOperator should not be allowed. We correctly checked the case
of an empty Expected, but forgot to also check the case of a missing
Expected. So the new code also fixes this corner case, and we include
a new test case for it (which passes on DynamoDB and used to fail in
Alternator but passes after this patch).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200528214721.230587-1-nyh@scylladb.com>
I just hit a circularity in header inclusion that I traced back to the
fact that schema.hh includes compaction_strategy.hh. schema.hh is in
turn included in lots of places, so a circularity is not hard to come
by.
The schema header really only needs to know about the compaction_type,
so it can inform schema users about it. Following the trend in header
clenups, I am moving that to a separate header which will both break
the circularity and make sure we are included less stuff that is not
needed.
With this change, Scylla fails to compile due to a new missing forward
declaration at index/secondary_index_manager.hh, so this is fixed.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200527172203.915936-1-glauber@scylladb.com>
Until now, view updates were generated with a bunch of random
time points, because the interface was not adjusted for passing
a single time point. The time points were used to determine
whether cells were alive (e.g. because of TTL), so it's better
to unify the process:
1. when generating view updates from user writes, a single time point
is used for the whole operation
2. when generating view updates via the view building process,
a single time point is used for each build step
NOTE: I don't see any reliable and deterministic way of writing
test scenarios which trigger problems with the old code.
After #6488 is resolved and error injection is integrated
into view.cc, tests can be added.
Fixes#6429
Tests: unit(dev)
Message-Id: <f864e965eb2e27ffc13d50359ad1e228894f7121.1590070130.git.sarna@scylladb.com>
The following UDFs are defined to control failure injector API usage:
* enable_injection(name, args)
* disable_injection(name)
All arguments have string type.
As currently function(terminal) is not supported by the parser,
the arguments must come from selected rows.
Added boost test for CQL API.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Fix disabled injection templates to match enabled ones.
Fix corresponding test to not be a continuation.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Currently the `querier_cache` is passed a semaphore during its
construction and it uses this semaphore to do all the inactive reader
registering/unregistering. This is inaccurate as in theory cached reads
could belong to different semaphores (although currently this is not yet
the case). As all queriers store a valid permit now, use this
permit to obtain the semaphore the querier is associated with, and
register the inactive read with this semaphore.
Remove `no_reader_permit()` and all ways to create empty (invalid)
permits. All permits are guaranteed to be valid now and are only
obtainable from a semaphore.
`reader_permit::semaphore()` now returns a reference, as it is
guaranteed to always have a valid semaphore reference.
We will soon require a valid permit for all reads, including low level
index reads. The sstable layer has several internal reads which can not
be associated with either the user or the system read semaphores or it
would be very hard to obtain the correct semaphore, for limited/no gain.
To be able to pass a valid permit still, we either expose a permit
parameter so upper layers can pass down one, or create a local semaphore
for these reads and use that to obtain a permit.
The following methods now require a permit to be passed to them:
* `sstables::sstabe::read_data()`: only used in tests.
The following methods use internal semaphores:
* `sstables::sstable::generate_summary()` used when loading an sstable.
* `sstables::sstable::has_partition_key()`: used by a REST API method.
All reads will soon require a valid permit, including those done during
compaction. To allow creating valid permits for these reads create a
compaction specific semaphore. This semaphore is unlimited as compaction
concurrency is managed by higher level layer, we use just for resource
usage accounting.
View update generation involves reading existing values from the base
table, which will soon require a valid permit to be passed to it, so
make sure we create and pass a valid permit to these reads.
We use `database::make_query_class_config()` to obtain the semaphore for
the read which selects the appropriate user/system semaphore based on
the scheduling group the base table write is running in.
Counter writes involve a read-before-write, which will soon require a
valid permit to be passed to it, so make sure we create and pass a valid
permit to this read. We use `database::make_query_class_config()` to
obtain the semaphore for the read which selects the appropriate
user/system semaphore based on the scheduling group the counter write is
running in.
And use it to obtain any query-class specific configuration that was
obtained from `table::config` before, such as the read concurrency
semaphore and the max memory limit for unlimited queries. As all users
of these items get these from the query class config now, we can remove
them from `table::config`.
Permits are now created with `make_permit()` and code is using the
permit to do all resource consumption tracking and admission waiting, so
we can remove these from the semaphore. This allows us to remove some
now unused code from the permit as well, namely the `base_cost` which
was used to track the resource amount the permit was created with. Now
this amount is also tracked with a `resource_units` RAII object, returned
from `reader_permit::wait_admission()`, so it can be removed. Curiously,
this reduces the reader permit to be glorified semaphore pointer. Still,
the permit abstraction is worth keeping, because it allows us to make
changes to how the resource tracking part of the semaphore works,
without having to change the huge amount of code sites passing around
the permit.
And use the reader_permit for this instead. This refactoring has
revealed a pre-existing bug in the `test_lifecycle_policy`, which is
also addressed in this patch. The bug is that said policy executes
reader destructions in the background, and these are not waited for. For
some reason, the semaphore -> permit transition pushes these races over
the edge and we start seeing some of these destruction fibers still
being unfinished when test scopes are exited, causing all sorts of
trouble. The solution is to introduce a special gate that tests can use
to wait for all background work to finish, before the test scope is
exited.
All reader are soon going to require a valid permit, so make sure we
have a valid permit which we can pass to the underlying reader when
creating it. This means `row_cache::make_reader()` now also requires
a permit to be passed to it.
All reader are soon going to require a valid permit, so make sure we
have a valid permit which we can pass to the delegate reader when
creating it. This means `memtable::make_flat_reader()` now also requires
a permit to be passed to it.
Internally the permit is stored in `scanning_reader`, which is used both
for flushes and normal reads. In the former case a permit is not
required.
Now that the most prevalent users (range scan and single partition
reads) all pass valid permits we require all users to do so and
propagate the permit down towards `make_sstable_reader()`. The plan is
to use this permit for restricting the sstable readers, instead of the
semaphore the table is configured with. The various
`make_streaming_*reader()` overloads keep using the internal semaphores
as but they also create the permit before the read starts and pass it to
`make_sstable_reader()`.
In preparation of a valid permit being required to be passed to all
mutation sources, create a permit before creating the shard readers and
pass it to the mutation source when doing so. The permit is also
persisted in the `shard_mutation_querier` object when saving the reader,
which is another forward looking change, to allow the querier-cache to
use it to obtain the semaphore the read is actually registered with.
In preparation of a valid permit being required to be passed to all
mutation sources, also add a permit to the querier object, which is then
passed to the source when it is used to create a reader.
We want to move away from the current practice of selecting the relevant
read concurrency semaphore inside `table` and instead want to pass it
down from `database` so that we can pass down a semaphore that is
appropriate for the class of the query. Use the recently created
`query_class_config` struct for this. This is added as a parameter to
`data_query`, `mutation_query` and propagated down to the point where we
create the `querier` to execute the read. We are already propagating
down a parameter down the same route -- max_memory_reverse_query --
which also happens to be part of `query_class_config`, so simply replace
this parameter with a `query_class_config` one. As the lower layers are
not prepared for a semaphore passed from above, make sure this semaphore
is the same that is selected inside `table`. After the lower layers are
prepared for a semaphore arriving from above, we will switch it to be
the appropriate one for the class of the query.
This struct will serve as a container of all the query-class
dependent configuration such as the semaphore to be used and the memory
limit for unlimited queries. As there is no good place to put this, we
create a separate header for it.
Mutation sources will soon require a valid permit so make sure we have
one and pass it to the mutation sources when creating the underlying
readers.
For now, pass no_reader_permit() on call sites, deferring the obtaining
of a valid permit to later patches.
This contains a reader concurrency semaphore for the tests, that they
can use to obtain a valid permit for reads. Soon we are going to start
working towards a point where all APIs taking a permit will require a
valid one. Before we start this work we must ensure test code is able to
obtain a valid permit.
We want to make `read_permit` the single interface through which reads
interact with the concurrency limiting mechanism. So far it was only
usable to track memory consumption. Add the missing `wait_admission()`
and `consume_resources()` to the permit API. As opposed to
`reader_concurrency_semaphore::` equivalents which returned a
permit, the `reader_permit::` variants jut return
`reader_permit::resource_units` which is an RAII holder for the acquired
units. This also allows for the permit to be created earlier, before the
reader is admitted, allowing for tracking pre-admission memory usage as
well. In fact this is what we are going to do in the next patches.
This patch also introduces a `broken()` method on the reader concurrency
semaphore which resolves waiters with an exception. This method is also
called internally from the semaphore's destructor. This is needed
because the semaphore can now have external waiters, who has to be
resolved before the semaphore itself is destroyed.
We want to refactor reader_permit::memory_units to work in terms of
reader_resources, as we are planning to use it for guarding count
resources as well. This patch makes the first step: renames it from
memory_units to resources_units. Since this is a very noisy change, we
do it in a separate patch, the semantic change is in the next patch.
Tenants get their own connections for statement verbs and are further
isolated from each other by different scheduling groups. A tenant is
identified by a scheduling group and a name. When selecting the client
index for a statement verb, we look up the tenant whose scheduling group
matches the current one. This scheduling group is persisted across the
RPC call, using the name to identify the tenant on the remote end, where
a reverse lookup (name -> scheduling group) happens.
Instead of a single scheduling group to be used for all statement verbs,
messaging_service::scheduling_config now contains a list of tenants. The
first among these is the default tenant, the one we use when the current
scheduling group doesn't match that of any configured tenant.
To make this mapping easier, we reshuffle the client index assignment,
such that statement and statement-ack verbs have the idx 2 and 3
respectively, instead of 0 and 3.
The tenant configuration is configured at message service construction
time and cannot be changed after. Adding such capability should be easy
but is not needed for query classification, the current user of the
tenant concept.
Currently two tenants are configured: $user (default tenant) and
$system.
Per-user SLA means we have connection classifications determined dynamically,
as SLAs are added or removed. This means the classification information cannot
be static.
Fix by making it a non-static vector (instead of a static array), allowing it
to be extended. The scheduling group member pointer is replaced by a scheduling
group as a member pointer won't work anymore - we won't have a member to refer
to.
On the client side, we supply an isolation cookie based on the connection index
On the server side, we convert an isolation cookie back to a scheduling_group.
This has two advantages:
- rpc processes the entire connection using the scheduling group, so that code
is also isolated and accounted for
- we can later add per-user connections; the previous approach of looking at the
verb to decide the scheduling_group doesn't help because we don't have a set of
verbs per user
With this, the main group sees <0.1% usage under simple read and write loads.
Move it from a function-local static to a class static variable. We will want
to extend it in two ways:
- add more information per connection index (like the rpc isolation cookie)
- support adding more connections for per-user SLA
As a first step, make it an array of structures and make it accessible to all
of messaging_service.
In the next patches we will match reads to the appropriate reader
concurrency semaphore based on the scheduling group they run in. This
will result in a lot of system reads that are executed during startup
and that were up to now (incorrectly) using the user read semaphore to
switch to the system read semaphore. This latter has a much more
constrained concurrency, which was observed to cause system reads to
saturate and block on the semaphore, slowing down startup.
To solve this, boost the concurrency of the system read semaphore during
startup to match that of the user semaphore. This is ok, as during
startup there are no user reads to compete with. After startup, before
we start serving user reads the concurrency is reverted back to the
normal value.
* seastar 37774aa78...c97b05b23 (13):
> test: futures: test async with throw_on_move arg
> Merge 'fstream: close file if construction fails' from Botond
> util: tmp_file: include <seastar/core/thread.hh>
> test: file_io: test_file_stat_method: convert to use tmp_dir
> reactor: don't mlock all memory at once
> future: specify uninitialized_wrapper_base default constructors as noexcept
> test: tls: ignore gate_closed_exception
> rpc: recv_helper: ignore gate_closed_exception when replying to oversized requests
> sharded: support passing arbitrary shard-dependent parameters to service constructors
> Update circleci configuration for C++20
> treewide: deprecate seastar::apply()
> Update README.md about c++ versions
> cmake: Remove Seastar_STD_OPTIONAL_VARIANT_STRINGVIEW
No change right now as that is the current api version on the seastar
we have, but being explicit will let us upgrade seastar and change the
api independently.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200527235211.301654-1-espindola@scylladb.com>
The QueryFilter parameter of Query is only partially implemented (issue
tests for it.
In this patch, we add comprehensive tests for this feature and all its
various operators, types, and corner cases. The tests cover both the
parts we already implemented, and the parts we did not yet.
As usual, all tests succeed on DynamoDB, but many still xfail on Alternator
pending the complete implementation.
Refs #5028.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200525141242.133710-1-nyh@scylladb.com>
test_compaction_with_multiple_regions() has two calls to std::shuffle(),
one using std::default_random_engine() has the PRNG, but the other, later
on, using the std::random_device directly. This can cause failures due to
entropy pool exhaustion.
Fix by making the `random` variable refer to the PRNG, not the random_device,
and adjust the first std::shuffle() call. This hides the random_device so
it can't be used more than once.
Message-Id: <20200527124247.2187364-1-avi@scylladb.com>
Boost test macros are not safe to use in multiple shards (threads).
Doing so will result in their output being interwoven, making it
unreadable and generating invalid XML test reports. There was a lot of
back-and-forth on how to solve this, including introducing thread-safe
wrappers of the boost test macros, that use locks. This patch does
something much simple: it defines a bunch of replacement utility
functions for the used macros. These functions use the thread safe
seastar logger to log messages and throw exceptions when the
test has to be failed, which is pretty much what boost test does too.
With this the previously seen complaint about invalid XML is gone.
Example log messages from the utility functions:
DEBUG 2020-05-27 13:32:54,248 [shard 1] testlog - check_equal(): OK @ validate_result() test/boost/multishard_mutation_query_test.cc:863: ckp{0004fe57c8d2} == ckp{0004fe57c8d2}
DEBUG 2020-05-27 13:32:54,248 [shard 1] testlog - require(): OK @ validate_result() test/boost/multishard_mutation_query_test.cc:855
Fixes: #4774
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200527104426.176342-1-bdenes@scylladb.com>
Boost test uses colored output by default, even when the output of the
test is redirected to a file. This makes the output quite hard to read
for example in Jenkins. This patch fixes this by disabling the colored
output when stdout is not a tty. This is in line with the colored output
of configure.py itself, which is also enabled only if stdout is a tty.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200526112857.76131-1-bdenes@scylladb.com>
seastar::apply() is deprecated in recent versions of seastar in favor
of std::apply(), so stop including its header. Calls to unqualified
apply(..., std::tuple<>) are resolved to std::apply() by argument
dependent lookup, so no changes to call sites are necessary.
This avoids a huge number of deprecation warnings with latest seastar.
Message-Id: <20200526090552.1969633-1-avi@scylladb.com>
We had to wait many years for it, but finally we have a starts_with()
method in C++20. Let's use it instead of ugly substr()-based code.
This is probably not a performance gain - substr() for a string_view
was already efficient. But it makes the code easier to understand,
and it allows us to rejoice in our decision to switch to C++20.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200526185812.165038-2-nyh@scylladb.com>
In commit cb7d3c6b55 we started to check
if two base64-encoded strings begin with each other without decoding
the strings first.
However, we missed the check_BEGINS_WITH function which does the same
thing. So this patch fixes this function as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200526185812.165038-1-nyh@scylladb.com>
"
In several tests we were calling random_device::operator() in a tight
loop. This is a slow operation, and in gcc 10 can fail if called too
frequently due to a bug [1].
Change to use a random_engine instead, seeded once from the
random_device.
Tests: unit (dev)
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94087
"
* 'entropy' of git://github.com/avikivity/scylla:
tests: lsa_sync_eviction_test: don't exhaust random number entropy
tests: querier_cache_test: don't exhaust random number entropy
tests: loading_cache_test: don't exhaust random number entropy
tests: dynamic_bitset_test: don't exhaust random number entropy
In python, `is` and `is not` checks object identity, not value
equivalence, yet in `idl-compiler.py` it is used to compare strings.
Newer python versions (that shipped in Fedora32) complains about this
misuse, so this patch fixes it.
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200526091811.50229-1-bdenes@scylladb.com>
This gives us access to std::ranges, the spaceship operator, and more.
Note coroutines are not yet enabled (these require g++ -fcoroutines) as
we are still working our problem with address santizer support.
Tests: unit (dev, debug, release)
Message-Id: <20200521092157.1460983-1-avi@scylladb.com>
Alternator supports four ways in which write operations can use quorum
writes or LWT or both, which we called "write isolation policies".
Until this patch, Alternator defaulted to the most generally safe policy,
"always_use_lwt". This default could have been overriden for each table
separately, but there was no way to change this default for all tables.
This patch adds a "--alternator-write-isolation" configuration option which
allows changing the default.
Moreover, @dorlaor asked that users must *explicitly* choose this default
mode, and not get "always_use_lwt" without noticing. The previous default,
"always_use_lwt" supports any workload correctly but because it uses LWT
for all writes it may be disappointingly slow for users who run write-only
workloads (including most benchmarks) - such users might find the slow
writes so disappointing that they will drop Scylla. Conversely, a default
of "forbid_rmw" will be faster and still correct, but will fail on workloads
which need read-modify-write operations - and suprise users that need these
operations. So Dor asked that that *none* of the write modes be made the
default, and users must make an informed choice between the different write
modes, rather than being disappointed by a default choice they weren't
aware of.
So after this patch, Scylla refuses to boot if Alternator is enabled but
a "--alternator-write-isolation" option is missing.
The patch also modifies the relevant documentation, adds the same option to
our docker image, and the modifies the test-running script
test/alternator/run to run Scylla with the old default mode (always_use_lwt),
which we need because we want to test RMW operations as well.
Fixes#6452
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200524160338.108417-1-nyh@scylladb.com>
The format is currently sitting in storage_service, but the
previous set patched all the users not to call it, instead
they use sstables_manager to get the highest supported format.
So this set finalizes this effort and places the format on
sstables_manager(s).
The set introduces the db::sstables_format_selector, that
- starts with the lowest format (ka)
- reads one on start from system tables
- subscribes on sstables-related features and bumps
up the selection if the respective feature is enabled
During its lifetime the selector holds a reference to the
sharded<database> and updates the format on it, the database,
in turn, propagates it further to sstables_managers. The
managers start with the highest known format (mc) which is
done for tests.
* https://github.com/xemul/scylla br-move-sstables-format-4:
storage_service: Get rid of one-line helpers
system_keyspace: Cleanup setup() from storage_service
format_selector: Log which format is being selected
sstables_manager: Keep format on
format_selector: Make it standalone
format_selector: Move the code into db/
format_selector: Select format locally
storage_service: Introduce format_selector
storage_service: Split feature_enabled_listener::on_enabled
storage_service: Tossing bits around
features: Introduce and use masked features
features: Get rid of per-features booleans
Instead of waiting for all replicas to reply execute prune after quorum
of replicas. This will keep system.paxos smaller in the case where one
node is down.
Fixes#6330
Message-Id: <20200525110822.GC233208@scylladb.com>
* seastar ee516b1c...37774aa7 (12):
> task: specify the default constructor as noexcept
> scheduling: scheduling_group: specify explicit constructor as noexcept
> net: tcp: use var after std::move()ed
> future: implement make_exception_future_with_backtrace
> future: Add noexcept to a few functions
> scheduling: Add noexcept to a couple of functions
> future: Move current_exception_as_future out of internal
> future: Avoid a call to std::current_exception
> seastar.hh: fix typo in doxygen main page text
> future: Replace a call to futurize_apply with futurize_invoke
> rpc: document how isolation work
> future: Optimize any::move_it
rt is moved before rt.tomb.timestamp is retrieved, so there's a
something that looks like use-after-move here (but really isn't).
found it while auditting the code.
[avi: adjusted changelog to note that it's not really a use-after-move]
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200525141047.168968-1-raphaelsc@scylladb.com>
Merge pull request https://github.com/scylladb/scylla/pull/6484 by
Kamil Braun:
Allow a node to join without bootstrapping, even if it couldn't contact
other nodes.
Print a BIG WARNING saying that you should never join nodes without
bootstrapping (by marking it as a seed or using auto_bootstrap=off).
Only the very first node should (must) be joined as a seed.
If you want to have more seeds, first join them using the only supported
way (i.e. bootstrap them), and only AFTER they have bootstrapped, change
their configuration to include them in the seed list.
Does not fix, but closes#6005. Read the discussion: it's enlightening.
See scylladb/scylla-docs#2647 for the correct procedure of joining a node.
Reverts 7cb6ac3.
The tests for the contains() operator of FilterExpression were based on
an incorrect understanding of what this operator does. Because the tests
were (as usual) run against DynamoDB and passed, there was nothing wrong
in the test per se - but it contains comments based on the wrong
understanding, and also various corner cases which aren't as interesting
as I thought (and vice versa - missed interesting corner cases).
All these tests continue to pass on DynamoDB, and xfail on Alternator
(because we didn't implement FilterExpression yet).
Refs #5038.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200525123812.131209-1-nyh@scylladb.com>
Same as 9d91ac345a, drop dependency on pystache
since it nolonger present in Fedora 32.
To implement it, simplified debian package build process.
It will be generate debian/ directory when building relocatable package,
we just need to run debuild using the package.
To generate debian/ directory this commit added debian_files_gen.py,
it construct whole directory including control and changelog files
from template files.
Since we need to stop pystache, these template files swiched to
string.Template class which is included python3 standard library.
see: https://github.com/scylladb/scylla/pull/6313
This patch cleans the estimated histogram implementation.
It removes the FIXME that were left in the code from the migration time
and the if0 commented out code.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
We call shuffle() with a random_device, extracting a true random
number in each of the many calls shuffle() will invoke.
Change it to use a random_engine seeded by a random_device.
This avoids exhausting entropy, see [1] for details.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94087
rand_int() re-creates a random device each time it is called.
Change it to use a static random_device, and get random numbers
from a random_engine instead of from the device directly.
This avoids exhausting entropy, see [1] for details.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94087
rand_int() re-creates a random device each time it is called.
Change it to use a static random_device, and get random numbers
from a random_engine instead of from the device directly.
This avoids exhausting entropy, see [1] for details.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94087
tests_random_ops() extracts a real random number from a random_device.
Change it to use a random number engine.
This avoids exhausting entropy, see [1] for details.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94087
Make the database be the format_selector target, so
when the format is selected its set on database which
in turn just forwards the selection into sstables
managers. All users of the format are already patched
to read it from those managers.
The initial value for the format is the highest, which
is needed by tests. When scylla starts the format is
updated by format_selector, first after reading from
system tables, then by selectiing it from features.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Remove the selector from storage_service and introduce
an instance in main.cc that starts soon after the gossiper
and feature_service, starts listening for features and
sets the selected format on storage_service.
This change includes
- Removal of for_testing bit from format_selector constructor,
now tests just do not use it
- Adding a gate to selection routine to make sure on exit all
the selection stuff is done. Although before the cluster join
the selector waits for the feature listeners to finish (the
.sync() method) this gate is still required to handle aborted
start cases and wait for gossiper announcement from selector
to complete.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now format_selector uses storage_service as a place to
keep the selected format. Change this by keeping the
selected format on selector itself and after selection
update one on the target.
The selector starts with the lowest format to maybe bumps
it up later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The final goal is to have a entity that will
- read the saved sstables format (if any)
- listen for sstables format related features enabling
- select the top-most format
- put the selected format onto a "target"
- spread the world about it (via gossiper)
The target is the service from which the selected format is
read (so the selector can be removed once features agreement
is reached). Today it's the storage_service, but at the end
of this series it will be sstables_manager.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The split is into two parts, the goal is to move the 2nd one (the
selection logic itself) into another class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to have main.cc add code between prepare_to_join
and join_token_ring. As a side effect this drives us closer
to proper split of storage service into sharded service itslef
vs start/boot/join code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays the knowledge about known/supported features is
scattered between frature_service and storage_service. The
latter uses knowledge about the selected _sstables_format
to alter the "supported" set.
Encapsulate this knowledge inside the feature_service with
the help of "masked_features" -- those, that shouldn't be
advertized to other nodes. When only maskable feature for
today is the UNBOUNDED_RANGE_TOMBSTONES one. Nowadays it's
reported as supported only if the sstables format is MC.
With this patch it starts as masked and gets unmasked when
the sstables format is selected to be MC, so the change is
correct.
This will make it possible to move sstables_format from
storage service to anywhere else.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The set of bool enable_something-s on feature_fonfig duplicates
the disabled_features set on it, so remove the former and make
full use of the latter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We had a very limited set of tests for the KeyConditions feature of
Query, which some error cases as well as important use cases (such as
bytes keys), leading to bugs #6490 and #6495 remaining undiscovered.
This patch adds a comprehensive test for the KeyConditions and (hopefully)
all its different combinations of operators, types, and many cases of errors.
We already had a comprehensive test suite for the newer
KeyConditionsExpression syntax, and this patch brings a similar level of
coverage for the older KeyConditions syntax.
Refs #6490
Refs #6495
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200524141800.104950-3-nyh@scylladb.com>
Improve error messages coming from Query's KeyCondition parameter when
wrong ComparisonOperators were used (issue discovered by @Orenef11).
At one point the error message was missing a parameter so resulted in an
internal error, while in another place the message mentioned an unuseful
number (enum) for the operator instead of its name. This patch fixes these
error messages.
Fixes#6490
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200524141800.104950-2-nyh@scylladb.com>
Our parsing of values in a KeyConditions paramter of Query was done naively.
As a result, we got bizarre error messages "condition not met: false" when
these values had incorrect type (this is issue #6490). Worse - the naive
conversion did not decode base64-encoded bytes value as needed, so
KeyConditions on bytes-typed keys did not work at all.
This patch fixes these bugs by using our existing utility function
get_key_from_typed_value(), which takes care of throwing sensible errors
when types don't match, and decoding base64 as needed.
Unfortunately, we didn't have test coverage for many of the KeyConditions
features including bytes keys, which is why this issue escaped detection.
A patch will follow with much more comprehensive tests for KeyConditions,
which also reproduce this issue and verify that it is fixed.
Refs #6490Fixes#6495
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200524141800.104950-1-nyh@scylladb.com>
When the 'forbid_rmw' write isolation policy is selected, read-modify-write
are intentionally forbidden. The error message in this case used to say:
"Read-modify-write operations not supported"
Which can lead users to believe that this operation isn't supported by this
version of Alternator - instead of realizing that this is in fact a
configurable choice.
So in this patch we just change the error message to say:
"Read-modify-write operations are disabled by 'forbid_rmw' write isolation policy. Refer to https://github.com/scylladb/scylla/blob/master/docs/alternator/alternator.md#write-isolation-policies for more information."
Fixes#6421.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200518125538.8347-1-nyh@scylladb.com>
Commit 75cf255c67 (repair: Ignore keyspace
that is removed in sync_data_using_repair) is not enough to fix the
issue because when the repair master checks if the table is dropped, the
table might not be dropped yet on the repair master.
To fix, the repair master should check if the follower failed the repair
because the table is dropped by checking the error returned from
follower.
With this patch, we would see
WARN 2020-04-14 11:19:00,417 [shard 0] repair - repair id 1 on shard 0
completed successfully, keyspace=ks, ignoring dropped tables={cf}
when the table is dropped during bootstrap.
Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_test
Fixes: #5942
"
This small series instructs seastar-json2code.py to also create a .cc
file. This reduces header bloat and fixes the current stack usage
warning in a dev build.
"
* 'espindola/json2code-cc' of https://github.com/espindola/scylla:
configure.py: Pass --create-cc to seastar-json2code.py
configure.py: Add a Source base class
configure.py: Fix indentation
DynamoDB seems to have started refusing requests unless
they include Content-Type header set to the following value:
application/x-amz-json-1.0
In order to make sure that manual tests work correctly,
let's add this header.
Message-Id: <ae0edafa311bce27b27e9e72aa51bb9717c360f2.1590052823.git.sarna@scylladb.com>
In docs/protocols.md, describing the protocols used by Scylla's (both
inter-node protocols and client-facing protocols), add a paragraph about
the ability to inspect most of these protocols, including Scylla's internal
inter-node protocol, using wireshark. Link to Piotr Sarna's recent blog post
about how to do this.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200524065248.76898-1-nyh@scylladb.com>
Print a BIG WARNING saying that you should never join nodes without
bootstrapping (by marking it as a seed or using auto_bootstrap=off).
Only the very first node should (must) be joined as a seed.
If you want to have more seeds, first join them using the only supported
way (i.e. bootstrap them), and only AFTER they have bootstrapped, change
their configuration to include them in the seed list.
Tested against performance regression using:
build/release/test/perf/perf_fast_forward --run-test=small-partition-skips -c1
I get similar results before and after the patch.
Message-Id: <20200521213032.15286-1-tgrabiec@scylladb.com>
Introduce ~/.config/scylladb/dbuild configuration file, and
SCYLLADB_DBUILD environment variables, that inject options into
the docker run command. This allows adding bind mounts for ccache
and distcc directories, as well as any local scripts and PATH
or other environment configuration to suit the user's needs.
Message-Id: <20200521133529.25880-1-avi@scylladb.com>
After "Make replacing node take writes" series, with repair based node
operations disabled, we saw the replace operation fail like:
```
[shard 0] init - Startup failed: std::runtime_error (unable to find
sufficient sources for streaming range (9203926935651910749, +inf) in
keyspace system_auth)
```
The reason is the system_auth keyspace has default RF of 1. It is
impossible to find a source node to stream from for the ranges owned by
the replaced node.
In the past, the replace operation with keyspace of RF 1 passes, because
the replacing node calls token_metadata.update_normal_tokens(tokens,
ip_of_replacing_node) before streaming. We saw:
```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9021954492552185543, -9016289150131785593] exists on {127.0.0.6}
```
Node 127.0.0.6 is the replacing node 127.0.0.5. The source node check in
range_streamer::get_range_fetch_map will pass if the source is the node
itself. However, it will not stream from the node itself. As a result,
the system_auth keyspace will not get any data.
After the "Make replacing node take writes" series, the replacing node
calls token_metadata.update_normal_tokens(tokens, ip_of_replacing_node)
after the streaming finishes. We saw:
```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9049647518073030406, -9048297455405660225] exists on {127.0.0.5}
```
Since 127.0.0.5 was dead, the source node check failed, so the bootstrap
operation.
Ta fix, we ignore the keyspace of RF 1 when it is unable to find a source
node to stream.
Fixes#6351
Currently, replace and bootstrap share the same streaming reason,
stream_reason::bootstrap, because they share most of the code
in boot_strapper.
In order to distinguish the two, we need to introduce a new stream
reason, stream_reason::replace. It is safe to do so in a mixed cluster
because current code only check if the stream_reason is
stream_reason::repair.
Refs: #6351
When index file is larger than 4GB, offset calculation will overflow
uint32_t and _promoted_index_end will be too small.
As a result, promoted_index_size calculation will underflow and the
rest of the page will be interpretd as a promoted index.
The partitions which are in the remainder of the index page will not
be found by single-partition queries.
Data is not lost.
Introduced in 6c5f8e0eda.
Fixes#6040
Message-Id: <20200521174822.8350-1-tgrabiec@scylladb.com>
Replace it with std::tuple, introduce range_populating_reader::read_result
type alias for less keystrokes.
This makes row_cache.o compilation warn-less.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200518160511.26984-1-xemul@scylladb.com>
- base image changed from Fedora 31 to Fedora 32
- disambiguate base image to use docker.io registry
- pystache and python-casasndra-driver are no longer availble,
so use pip3 to install them. Add pip3 to packages.
- since pip3 installs commands to /usr/local/bin, update checks
in build_deb to check for those too
Fedora 32 packages gcc 10, which has support for coroutines.
Message-Id: <20200521063138.1426400-1-avi@scylladb.com>
In order to remove a FIXME, code which checks a BEGINS_WITH
relation between base64-encoded strings is computed in a way
which does not involve decoding the whole string.
In case of padding, the remainders are still decoded, but their
size is bounded by 3, which means they will be eligible for the
small string optimization.
In order to get rid of a FIXME, the code which computes the size
of decoded base64 string based only on encoded size + padding is added.
The result is an O(1) function with just a couple of ops
(15 when checking with godbolt and gcc9), so it's a general improvement
over having to allocate a string and get its size.
Currently, push() attaches a continuation to the _not_full future, if
push() is called when the buffer is already full. This is not needed as
we can safely push the fragment even if the buffer is already full.
Furthermore we can eliminate the possibility of push() being called when
the buffer is full, by checking whether it is full *after* pushing the
fragment, not before.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200521055840.376019-1-bdenes@scylladb.com>
This reverts commit 43b488a7bc. The commit
was originally reverted because a dtest was sensitive to the value. The
dtest is fixed now, so let's revert the revert as requested by Glauber.
Fixes#6459
When moving or removing endpoints, we should ensure
that the set of available racks reflect the nodes
known, i.e. match what would be the result of a
reboot + create sets initially.
Message-Id: <20200519153300.15391-1-calle@scylladb.com>
Although Python 2 is deprecated, some systems today still have "python"
and "pytest" pointing to Python 2, so it would be convenient for the
Alternator tests to work on both Python 2 and 3 if it's not too much
of an effort.
And it really isn't too much of an effort - they all work on both versions
except for one problem introduced in the previous test patch: The syntax b''
for an empty byte array works correctly on Python 3 but incorrectly on
Python 2: In Python 2, b'' is just a normal empty string, not byte array,
which confuses Boto3 which refuses to accept a string as a value for a
byte-array key.
The trivial fix is to replace b'' by bytearray('', 'utf-8').
Uglier, but works as expected on both Python 2 and 3.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200519214321.25152-1-nyh@scylladb.com>
* seastar 92365e7b8...ee516b1cc (17):
> build: use -fcommon compiler flag for dpdk
> coroutines: reduce template bloat
> thread: make async noexcept
> file: specify methods noexcept
> doc: drop grace period for old C++ standard revisions
> semaphore: specify consume_units as noexcept
> doc/tutorial.md: add short intro to seastar::sharded<>
> future: Move promise_base move constructor out of line
> coroutines: enable for C++20
> tutorial: adjust evaluation order warning to note it is C++14-only
> rpc_test: Fix test_stream_connection_error with valgrind
> file: Remove unused lambda capture
> install-dependencies: add valgrind to arch
> coroutines_test: Don't access a destroyed lambda
> tutorial: warn about evaluation order pitfall
> merge: apps: improvements in httpd and seawreck
> file: Move functions out of line
This adds a Json2Code class now that both a .cc and a .hh are
produced.
Creating a .cc file reduces header bloat and fixes the current stack
too large warning in a dev build.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The Alternator test (test/alternator/run) runs the real Scylla executable
to test it. Users sometimes want to run Scylla manually in parallel (on
different IP addresses, of course) and sometimes use commands like
"killall scylla" to stop it, may be surprised that this command will also
unintentionally kill a running test.
So what this patch does is to name the Scylla process used for the test
with the name "test_scylla". It will be visible as "test_scylla" in top,
and a "killall scylla" will not touch it. You can, of course, kill it with
a "killall test_scylla" if you wish.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200519071604.19161-1-nyh@scylladb.com>
According to DynamoDB, string/binary blob keys cannot be empty
and this definition affects secondary indexes as well.
As a result, only nonempty strings/binary blobs are accepted
as values for columns which form a GSI or LSI key.
Clarify in README.md that the instructions there will build a Docker image
containing a Scylla executable downloaded from downloads.scylla.com - NOT
the one you built yourself. The image is also CentOS based - not Fedora-based
as claimed.
In addition, a new dist/docker/redhat/README.md explains the somewhat
steps needed to actually build a Docker image with the Scylla executable
that you built. In the future, these steps should be automated (e.g.,
"ninja docker") but until then, let's at least document the process.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200518151123.11313-1-nyh@scylladb.com>
C++20 deprecates capturing this in default-copy lambdas ([=]), with
good reason. Move to explicit captures to avoid any ambiguity and
reduce warning spew.
Message-Id: <20200517150834.753463-1-avi@scylladb.com>
In order to add tracing to places where it can be useful,
e.g. materialized view updates and hinted handoff, tracing state
is propagated to all applicable call sites.
CDC Log is a time series which uses time window compaction with some
time window. Data is TTLed with the same value. This means that sstable
won't become fully expired more often than once per time window
duration.
This patch sets expired_sstable_check_frequency_seconds compaction
strategy parameter to half of the time window. Default value of this
parameter is 10 minutes which in most cases won't be a good fit.
By default, we set TTL to 24h and time window to 1h. This means that
with a default value of the parameter we would be checking every 10
minutes but new expired sstable would appear only every 60 minutes.
The parameter is set to half of the time window duration because it's
the expected time we have to wait for sstable to become fully expired.
Half of the time we will wait longer and half of the time we will wait
shorter.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
LWT batches conditions can't span multiple tables.
This was detected in batch_statement::validate() called in ::prepare().
But ::cas_result_set_metadata() was built in the constructor,
causing a bitset assert/crash in a reported scenario.
This patch moves validate() to the constructor before building metadata.
Closes#6332
Tested with https://github.com/scylladb/scylla-dtest/pull/1465
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
C++20 deprecates capturing this in default-copy lambdas ([=]), with
good reason. Move to explicit captures to avoid any ambiguity and
reduce warning spew.
Message-Id: <20200517150921.754073-1-avi@scylladb.com>
C++20 deprecates capturing this in default-copy lambdas ([=]), with
good reason. Move to explicit captures to avoid any ambiguity and
reduce warning spew.
Message-Id: <20200517151023.754906-1-avi@scylladb.com>
In a recent next failure I got the following backtrace
#3 0x00007efd71251a66 in __GI___assert_fail (assertion=assertion@entry=0x2d0c00 "this->_con->get()->sink_closed()", file=file@entry=0x32c9d0 "./seastar/include/seastar/rpc/rpc_impl.hh", line=line@entry=795,
function=function@entry=0x270360 "seastar::rpc::sink_impl<Serializer, Out>::~sink_impl() [with Serializer = netw::serializer; Out = {repair_row_on_wire_with_cmd}]") at assert.c:101
#4 0x0000000001f5d2c3 in seastar::rpc::sink_impl<netw::serializer, repair_row_on_wire_with_cmd>::~sink_impl (this=<optimized out>, __in_chrg=<optimized out>) at ./seastar/include/seastar/core/future.hh:312
#5 0x0000000001f5d2f4 in seastar::shared_ptr_count_for<seastar::rpc::sink_impl<netw::serializer, repair_row_on_wire_with_cmd> >::~shared_ptr_count_for (this=0x60100075b680, __in_chrg=<optimized out>)
at ./seastar/include/seastar/core/shared_ptr.hh:463
#6 seastar::shared_ptr_count_for<seastar::rpc::sink_impl<netw::serializer, repair_row_on_wire_with_cmd> >::~shared_ptr_count_for (this=0x60100075b680, __in_chrg=<optimized out>) at ./seastar/include/seastar/core/shared_ptr.hh:463
#7 0x000000000240f2e6 in seastar::shared_ptr<seastar::rpc::sink<repair_row_on_wire_with_cmd>::impl>::~shared_ptr (this=0x601003118590, __in_chrg=<optimized out>) at ./seastar/include/seastar/core/future.hh:427
#8 seastar::rpc::sink<repair_row_on_wire_with_cmd>::~sink (this=0x601003118590, __in_chrg=<optimized out>) at ./seastar/include/seastar/rpc/rpc_types.hh:270
#9 <lambda(auto:134&)>::<lambda(const seastar::rpc::client_info&, uint64_t, seastar::rpc::source<repair_hash_with_cmd>)>::<lambda(std::__exception_ptr::exception_ptr)>::~<lambda> (this=0x601003118570, __in_chrg=<optimized out>)
at repair/row_level.cc:2059
This patch changes a few functions to use finally to make sure the sink
is always closed.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200515202803.60020-1-espindola@scylladb.com>
Some statements made in docs/alternator/alternator.md on having a single
keyspace, or recommending a DNS setup, are not up-to-date. So fix them.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200517132444.9422-1-nyh@scylladb.com>
The test/alternator/run script starts Scylla to be tested. It waits until
CQL is responsive and if Scylla dies earlier, recognizes the failure
immediately. This is useful so we see boot errors immediately instead of
waiting for the first test to timeout and fail.
However, Scylla starts the Alternator service after CQL. So it is possible
that after the "run" script found CQL to be up, Alternator couldn't start
(e.g., bad configuration parameters) and Scylla is shut down, and instead
of recognizing this situation, we start the actual test.
The fix is simple: don't start the tests until verifying that Alternator
is up. We verify this using the trivial healthcheck request (which is
nothing more than an HTTP GET request).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200517125851.8484-1-nyh@scylladb.com>
The instructions in README.md about building a docker image start with
"cd dist/docker", but it actually needs to be "cd dist/docker/redhat".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200517152815.15346-1-nyh@scylladb.com>
"
This series changes the describe_ring API to use HTTP stream instead of serializing the results and send it as a single buffer.
While testing the change I hit a 4-year-old issue inside service/storage_proxy.cc that causes a use after free, so I fixed it along the way.
Fixes#6297
"
* amnonh-stream_describe_ring:
api/storage_service.cc: stream result of token_range
storage_service: get_range_to_address_map prevent use after free
The get token range API can become big which can cause large allocation
and stalls.
This patch replace the implementation so it would stream the results
using the http stream capabilities instead of serialization and sending
one big buffer.
Fixes#6297
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The implementation of get_range_to_address_map has a default behaviour,
when getting an empty keypsace, it uses the first non-system keyspace
(first here is basically, just a keyspace).
The current implementation has two issues, first, it uses a reference to
a string that is held on a stack of another function. In other word,
there's a use after free that is not clear why we never hit.
The second, it calls get_non_system_keyspaces twice. Though this is not
a bug, it's redundant (get_non_system_keyspaces uses a loop, so calling
that function does have a cost).
This patch solves both issues, by chaning the implementation to hold a
string instead of a reference to a string.
Second, it stores the results from get_non_system_keyspaces and reuse
them it's more efficient and holds the returned values on the local
stack.
Fixes#6465
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
In add40d4e59, we relaxed the prohibition of unbounded DELETE and
stopped testing the failure message. But there are still scenarios
when unbounded DELETE is prohibited, so add a test to ensure we
continue to catch it where appropriate.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
The shutdown process of compaction manager starts with an explicit call
from the database object. However that can only happen everything is
already initialized. This works well today, but I am soon to change
the resharding process to operate before the node is fully ready.
One can still stop the database in this case, but reshardings will
have to finish before the abort signal is processed.
This patch passes the existing abort source to the construction of the
compaction_manager and subscribes to it. If the abort source is
triggered, the compaction manager will react to it firing and all
compactions it manages will be stopped.
We still want the database object to be able to wait for the compaction
manager, since the database is the object that owns the lifetime of
the compaction manager. To make that possible we'll use a future
that is return from stop(): no matter what triggered the abort, either
an early abort during initial resharding or a database-level event like
drain, everything will shut down in the right order.
The abort source is passed to the database, who is responsible from
constructing the compaction manager
Tests: unit (debug), manual start+stop, manual drain + stop, previously
failing dtests.
"
Fixed-size integer types are legal varints - both are serialized as
two's complement in network byte order. So there's tinyint, shortint,
int, and bigint can be interpreted as varints.
Change is_compatible_with() to reflect that.
Message-Id: <20200516115143.28690-2-avi@scylladb.com>
The short and byte types are two's complement network byte order,
just like varint (except fixed size) and so varint can read them
just fine.
Mark them as value compatible like int32_type and long_type.
A unit test is added.
Message-Id: <20200516115143.28690-1-avi@scylladb.com>
This avoids potential use-after-move, since undefined c++ sequencing order
may std::move(f) in the lambda capture before evaluating f.stat().
Also, this makes use of a more generic library function that doesn't
require to open and hold on to the file in the application.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200514152054.162168-1-bhalevy@scylladb.com>
Consider: n1, n2, n1 is the repair master, n2 is the repair follower.
=== Case 1 ===
1) n1 sends missing rows {r1, r2} to n2
2) n2 runs apply_rows_on_follower to apply rows, e.g., {r1, r2}, r1
is written to sstable, r2 is not written yet, r1 belongs to
partition 1, r2 belongs to partition 2. It yields after row r1 is
written.
data: partition_start, r1
3) n1 sends repair_row_level_stop to n2 because error has happened on n1
4) n2 calls wait_for_writer_done() which in turn calls write_end_of_stream()
data: partition_start, r1, partition_end
5) Step 2 resumes to apply the rows.
data: partition_start, r1, partition_end, partition_end, partition_start, r2
=== Case 2 ===
1) n1 sends missing rows {r1, r2} to n2
2) n2 runs apply_rows_on_follower to apply rows, e.g., {r1, r2}, r1
is written to sstable, r2 is not written yet, r1 belongs to partition
1, r2 belongs to partition 2. It yields after partition_start for r2
is written but before _partition_opened is set to true.
data: partition_start, r1, partition_end, partition_start
3) n1 sends repair_row_level_stop to n2 because error has happened on n1
4) n2 calls wait_for_writer_done() which in turn calls write_end_of_stream().
Since _partition_opened[node_idx] is false, partition_end is skipped,
end_of_stream is written.
data: partition_start, r1, partition_end, partition_start, end_of_stream
This causes unbalanced partition_start and partition_end in the stream
written to sstables.
To fix, serialize the write_end_of_stream and apply_rows with a semaphore.
Fixes: #6394Fixes: #6296Fixes: #6414
The Redis API in Scylla only supports a small subset of the Redis
commands. Let's document what we support so people have the right
expectations when they try it out.
Avoid `f(s).then([s = std::move(s)] {})` patterns,
where the move into the lambda capture may potentially be
sequenced by the compiler before passing `s` to function `f`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200514131701.140046-1-bhalevy@scylladb.com>
The existing text did not explain what happens if additional DCs are added
to the cluster, so this patch improves the explanation of the status of
our support for global tables, including that issue.
Fixes#6353
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200513175908.21642-1-nyh@scylladb.com>
The shutdown process of compaction manager starts with an explicit call
from the database object. However that can only happen everything is
already initialized. This works well today, but I am soon to change
the resharding process to operate before the node is fully ready.
One can still stop the database in this case, but reshardings will
have to finish before the abort signal is processed.
This patch passes the existing abort source to the construction of the
compaction_manager and subscribes to it. If the abort source is
triggered, the compaction manager will react to it firing and all
compactions it manages will be stopped.
We still want the database object to be able to wait for the compaction
manager, since the database is the object that owns the lifetime of
the compaction manager. To make that possible we'll use a future
that is return from stop(): no matter what triggered the abort, either
an early abort during initial resharding or a database-level event like
drain, everything will shut down in the right order.
The abort source is passed to the database, who is responsible from
constructing the compaction manager.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We want stop() to be callable just once. Having the compaction manager
stopped twice is a potential indication that something is wrong.
Still there are places where we want to stop all ongoing compactions
and prevent new from running - like the drain operation. Today the
only operation that allows for cancellation of all existing compations
is stop(). To unweave this, we will split those two things.
A drain operation is carved out, and it should be safe to be called many
times. The compaction manager is usable after this, and new compactions
can even be sent if it happen to be enabled again (we currently don't)
A stop operation, which includes a drain, will only be allowed once. After
a stop() the compaction_manager object is no longer usable.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are having many issues with the stop code in the compaction_manager.
Part of the reason is that the "stopped" state has its meaning overloaded
to indicate both "compaction manager is not accepting compactions" and
"compaction manager is not ready or destructed".
In a later step we could default to enabled-at-start, but right now we
maintain current behavior to minimize noise.
It is only possible to stop the compaction manager once.
It is possible to enable / disable the compaction manager many times.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6427
by Piotr Jastrzębski:
CDC Log is a time series so it makes sense to use time window compaction
strategy for it.
Our support for time series is limited so we make sure that we don't create
more than 24 sstables.
If TTL is configured to 0, meaning data does not expire, we don't use time
window compaction strategy.
This PR also sets gc_grace_seconds to 0 when TTL is not set to 0.
Print the test command line and the UBSAN and ASAN env settings to the log
so the run can be easily reproduced (optionally with providing --random-seed=XXX
that is printed by scylla unit tests when they start).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200513110959.32015-1-bhalevy@scylladb.com>
After commit 88d2486fca, removal of shared SSTables is not atomic anymore.
They can be first removed from the list of shared SSTables and only later be
removed from the SSTable set. That list is used to filter out shared SSTables
from regular compaction candidates.
So it can happen that regular compaction pick up a shared SSTable as candidate
after it was removed from that list but before it was removed from the set.
To fix this, let's only remove a shared SSTable from that aforementioned list
after it was successfully removed from the SSTable set, so that a shared
SSTable cannot be selected for regular compaction anymore.
Fixes#6439.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200512175224.114487-1-raphaelsc@scylladb.com>
C++20 makes string literals defined with u8"my string" as using
a new type char8_t. This is sensible, as plain char might not
have 8 bits, but conflicts with our bytes type.
Adjust by having overloads that cast back to char*. This limits
us to environments where char is 8 bits, but this is already a
restriction we have.
Reviewed-by: Dejan Mircevski <dejan@scylladb.com>
Message-Id: <20200512101646.127688-1-avi@scylladb.com>
C++20 deprecates std::is_pod<> in favor of the easier-to-type
std::is_starndard_layout<> && std::is_trivial<>. Change to the
recommendation in order to avoid a flood of warnings.
Reviewed-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200512092200.115351-1-avi@scylladb.com>
std::memory_order is an unscoped enum, and so does not need its
members to be prefixed with std::memory_order::, just std::.
This used to work, but in C++20 it no longer does. Use the
standard way to name these constants, which works in both C++17
and C++20.
Reviewed-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200512092408.115649-1-avi@scylladb.com>
C++20 changed the parameter to the binary operation function in std::accumulate()
to be passed by value (quite sensibly). Adjust the code to be compatible by
using a #if. This will be removed once we switch over to C++20.
Message-Id: <20200512105427.142423-1-avi@scylladb.com>
C++20 makes string literals defined with u8"foo" return a new char8_t.
This is sensible but is noisy for us. Cast them to plain const char.
Message-Id: <20200512104751.137816-1-avi@scylladb.com>
C++20 makes string literals defined with u8"blah" return a new
char8_t type, which is sensible but noisy here.
Adjust for it by dropping an unneeded u8 in one place, and adding a
cast in another.
Message-Id: <20200512104515.137459-1-avi@scylladb.com>
C++20 passes the input to the binary operation by value (which is
sensible), but is not compatible with C++17. Add some #if logic
to support both methods. We can remove the logic when we fully
transition to C++20.
Message-Id: <20200512101355.127333-1-avi@scylladb.com>
In theory we shouldn't have empty keys in the database, as we validate
all keys that enter the database via CQL with
`validation::validate_cql_keys()`, which will reject empty keys. In this
context, empty means a single-component key, with its only component
being empty.
Yet recently we've seen empty keys appear in a cluster and wreak havoc
on it, as they will cause the memtable flush to fail due to the sstable
summary rejecting the empty key. This will cause an infinite loop, where
Scylla keeps retrying to flush the memtable and failing. The intermediate
consequence of this is that the node cannot be shut down gracefully. The
indirect consequence is possible data loss, as commitlog files cannot be
replayed as they just re-insert the empty key into the memtable and the
infinite flush retry circle starts all over again. A workaround is to
move problematic commitlog files away, allowing the node to start up.
This can however lead to data loss, if multiple replicas had to move
away commitlogs that contain the same data.
To prevent the node getting into an unusable state and subsequent data
loss, extend the existing defenses against invalid (empty) keys to the
commitlog replay, which will now ignore them during replay.
Fixes: #6106
* denesb/empty-keys/v5:
commitlog_replayer: ignore entries with invalid keys
test: lib/sstable_utils: add make_keys_for_shard
validation: add is_cql_key_invalid()
validation: validate_cql_key(): make key parameter a `partition_key_view`
partition_key_view: add validate method
We use boost::bimap for bi-directional conversion from protocol type
encodings to type objects.
Unfortunately, boost::bimap isn't C++20-ready.
Fortunately, we only used one direction of the bimap.
Replace with plain old std::unordered_map<>.
Message-Id: <20200512103726.134124-1-avi@scylladb.com>
Related commit: 85d5c3d
When attempting to send a hint, an exception might occur that results in
that hint being discarded (e.g. keyspace or table of the hint was
removed).
When such an exception is thrown, position of the hint will already be
stored in rps_set. We are only allowed to retain positions of hints that
failed to be sent and needed to be retried later. Dropping a hint is not
an error, therefore its position should be removed from rps_set - but
current logic does not do that.
Because of that bug, hint files with many discardable hints might cause
rps_set to grow large when the file is replayed. Furthermore, leaving
positions of such hints in rps_set might cause more hints than necessary
to be re-sent if some non-discarded hints fail to be sent.
This commit fixes the problem by removing positions of discarded hints
from rps_set.
Fixes#6433
* seastar e708d1df3a...92365e7b87 (11):
> tests: distributed_test: convert to SEASTAR_TEST_CASE
> Merge "Avoid undefined behavior on future self move assignments" from Rafael
> Merge "C++20 support" from Avi
> optimized_optional: don't use experimental C++ features
> tests: scheduling_group_test: verify that later() doesn't modify the current group
> tests: demos: coroutine_demo: add missing include for open_file_dma()
> rpc: minor documentation improvements
> rpc: Assert that sinks are closed
> Merge "Fix most tests under valgrind" from Rafael
> distributed_test: Fix it on slow machines
> rpc_test: Make sure we always flush and close the sink
loading_shard_values.hh: added missing include for gcc6-concepts.hh,
exposed by the submodule update.
Frozen toolchain updated for the new valgrind dependency.
When replaying the commitlog, pass keys to
`validation::validate_cql_key()`. Discard entries which fail validation
and warn about it in the logs.
This prevents invalid keys from getting into the system, possibly
failing the commitlog replay and the successful boot of the node,
preventing the node from recovering data.
A variant of make_keys() which creates keys for the requested shard. As
this version is more generic than the existing local_shards_only
variant, the former is reimplemented on top of the latter.
This is more general than the previous `const partition_key&` and allows
for passing keys obtained from the likes of `frozen_mutation` that only
have a view of the key.
While at it also change the schema parameter from schema_ptr to const
schema&. No need to pass a shared pointer.
We want to be able to pass `partition_key_view` to
`validation::validate_cql_key()`. As the latter wants to call
`validate()` on the key, replicate `partition_key::validate()` in
`partition_key_view`.
In write_end_of_stream, it does:
1) Write write_partition_end
2) Write empty mutation_fragment_opt
If 1) fails, 2) will be skipped, the consumer of the queue will wait for
the empty mutation_fragment_opt forever.
Found this issue when injecting random exceptions between 1) and 2).
Refs #6272
Refs #6248
This series adds support for taking a snapshot of multiple tables.
Fixes#6333
* amnonh-snapshot_keyspace_table:
api/storage_service.cc: Snapshot, support multiple tables
service/storage_service: Take snapshot of multiple tables
CDC Log is a time series with data TTLed by default to 24 hours so
it makes sense to use for it a time window compaction.
A window size is adjusted to the TTL configured for CDC Log so that
no more than 24 sstables will be created.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We shouldn't assume the I/O priority class for compactions. For
instance, if we are dealing with offstrategy compactions we may want to
use the maintenance group priority for them.
For now, all compactions are put in the compaction class. rewrite
compactions (scrub, cleanup) could be maintenance, but we don't have
clear access to the database object at this time to derive the
equivalent CPU priority. This is planned to be changed in the future,
and when we do change it, we'll adjust.
Same goes for resharding: while we could at this point change it we'd
risking memory pressure since resharding is run online and sstables are
shared until resharding is done. When we move it to offline execution
we'll do it with maintenance priority.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200512002233.306538-3-glauber@scylladb.com>
To do that - and still avoid a copy - we need to add some fields
to the compaction object that are exclusive to regular_compaction.
Still, not only this simplifies the code, resharding and regular
compaction look more and more alike.
This is done now in preparation for another patch that will add
more information to the descriptor.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200512002233.306538-2-glauber@scylladb.com>
In order to be sure that all nodes acknowledged that a table was
created, the CreateTable request will now only return after
seeing that schema agreement was reached.
Rationale: alternator users check if the table was created by issuing
a DescribeTable request, and assume that the table was correctly
created if it returns nonempty results. However, our current
implementation of DescribeTable returns local results, which is
not enough to judge if all the other nodes acknowledge the new table.
CQL drivers are reported to always wait for schema agreement after
issuing DDL-changing requests, so there should be no harm in waiting
a little longer for alternator's CreateTable as well.
Fixes#6361
Tests: alternator(local)
Since alternator is based on Scylla, two "already exists" error types
can appear when trying to create a table - that a table itself exists,
or that its keyspace does. That's however an implementation detail,
since alternator does not have a notion of keyspaces at all.
This patch unifies the error message to simply mention that a table
already exists, and comes with a more robust test case.
If the keyspace already exists, table creation will still be attempted.
Fixes#6340
Tests: alternator(local, remote)
Paxos may leave an operation in a background after returning result to a
caller. Lest add a counter for background/foreground paxos handlers so
that it will be easier to detect memory related issues.
Message-Id: <20200510092942.GA24506@scylladb.com>
"
A good portion of the values that one would want to be examine with
scylla-tools will be partition or clustering keys. While examining them
was possible before too, especially for single component keys, it
required manually extracting the components from it, so they can be
individually examined.
This series adds support for working with keys directly, by adding
prefixable and full compound type support.
When passing --prefix-compound or --full-compound, multiple types can be
passed, which will form the compound type.
Example:
$ scylla_types --print --prefix-compound -t TimeUUIDType -t Int32Type 0010d00819896f6b11ea00000000001c571b000400000010
(d0081989-6f6b-11ea-0000-0000001c571b, 16)
Another feature added in this series is validation. For this,
`compound_type::validate()` had to be implemented first. We already use
this in our code, but currently has a no-op body.
Example:
$ scylla-types --validate --full-compound -t TimeUUIDType -t Int32Type 0010d00819896f6b11ea00000000001c571b0004000000
0010d00819896f6b11ea00000000001c571b0004000000: INVALID - seastar::internal::backtraced<marshal_exception> (marshaling error: compound_type iterator - not enough bytes, expected 4, got 3 Backtrace: 0x1b2e30f
0x85c9d5
0x85cb07
0x85cc7b
0x85cd7c
0x85d2d7
0x844e03
0x84241b
0x84490b
0x844ae5
0x19c0362
0x19c0741
0x19c13d1
0x19c4b44
0x8aeb7a
0x8aeca7
0x19ebc90
0x19fb8d5
0x1a12b49
0x19c4376
0x19c47a6
0x19c4900
0x843373
/lib64/libc.so.6+0x271a2
0x84202d
)
Tests: unit(dev)
"
* 'tools-scylla-types-compound-support/v1' of https://github.com/denesb/scylla:
tools/scylla_types: add validation action
tools/scylla_types: add compound_type support
tools/scylla_types: single source of truth for actions
compound_type: implement validate()
compound_type: fix const correctness
tools: mv scylla_types scylla-types
When sending hints from one file, rps_set field in send_one_file_ctx
keeps track of commitlog positions of hints that are being currently
sent, or have failed to be sent. At the end of the operation, if sending
of some hints failed, we will choose position of the earliest hint that
failed to be sent, and will retry sending that file later, starting from
that position. This position is stored in _last_not_complete_rp.
Usually, this set has a bounded size, because we impose a limit of at
most 128 hints being sent concurrently. Because we do not attempt to
send any more hints after a failure is detected, rps_set should not have
more than 128 elements at a time.
Due to a bug, commitlog positions of old hints (older than
gc_grace_seconds of the destination table) were inserted into rps_set
but not removed after checking their age. This could cause rps_set to
grow very large when replaying a file with old hints.
Moreover, if the file mixed expired and non-expired hints (which could
happen if it had hints to two tables with different gc_grace_seconds),
and sending of some non-expired hints failed, then positions of expired
hints could influence calculation _last_not_complete_rp, and more hints
than necessary would be resent on the next retry.
This simple patch removes commitlog position of a hint from rps_set when
it is detected to be too old.
Fixes#6422
"
We inherited from Origin a `caching` table parameter. It's a map of named caching parameters. Before this PR two caching parameters were expected: `keys` and `rows_per_partition`. So far we have been ignoring them. This PR adds a new caching parameter called `enabled` which can be set to `true` or `false` and controls the usage of the cache for the table. By default, it's set to `true` which reflects Scylla behavior before this PR.
This new capability is used to disable caching for CDC Log table. It is desirable because CDC Log entries are not expected to be read often. They also put much more pressure on memory than entries in Base Table. This is caused by the fact that some writes to Base Table can override previous writes. Every write to CDC Log is unique and does not invalidate any previous entry.
Fixes#6098Fixes#6146
Tests: unit(dev, release), manual
"
* haaawk-dont_cache_cdc:
cdc: Don't cache CDC Log table
table: invalidate disabled cache on memtable flush
table: Add cache_enabled member function
cf_prop_defs: persist caching_options in schema
property_definitions: add get that returns variant
feature: add PER_TABLE_CACHING feature
caching_options: add enabled parameter
We use pystache to parametrize our scylla.spec, but pystache is not
present in Fedora 32. Fortunately rpm provides its own template mechanism,
and this patch switches to using it:
- no longer install pystache
- pass parameters via rpm "-D" options
- use 0/1 for conditionals instead of true/false as per rpm conventions
- sanitize the "product" variable to not contain dashes
- change the .spec file to use rpm templating: %{...} and %if ... %endif
instead of mustache templating
Input SSTables of resharding is deleted at the coordinator shard, not at the
shards they belong to.
We're not acquiring deletion semaphore before removing those input SSTables
from the SSTable set, so it could happen that resharding deletes those
SSTables while another operation like snapshot, which acquires the semaphore,
find them deleted.
Let's acquire the deletion semaphore so that the input SSTables will only
be removed from the set, when we're certain that nobody is relying on their
existence anymore.
Now resharding will only delete input SStables after they're safely removed
from the SSTable set of all shards they belong to.
unit: test(dev).
Fixes#6328.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200507233636.92104-1-raphaelsc@scylladb.com>
The "current compatibility with DynamoDB" section in alternator.md is where
we should list very briefly our state of compatibility - it's not the right
place to explain implementation details or track obscure bugs. I've
significantly shortened the "Tags" section because, in brief, we do
fully support tags and should say that we do.
I moved the two bugs mentioned in the text into the bug tracker:
Refs #6389
Refs #6391
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200507125022.22608-1-nyh@scylladb.com>
Allow examining partition and clustering keys, by adding support for
full and prefix compound types. The members of the compound type are
specified by passing several types with --type on the command line.
The patch implements:
- /storage_service/auto_compaction API endpoint
- /column_family/autocompaction/{name} API endpoint
Those APIs allow to control and request the status of background
compaction jobs for the existing tables.
The implementation introduces the table::_compaction_disabled_by_user.
Then the CompactionManager checks if it can push the background
compaction job for the corresponding table.
New members
===
table::enable_auto_compaction();
table::disable_auto_compaction();
bool table::is_auto_compaction_disabled_by_user() const
Test
===
Tests: unit(sstable_datafile_test autocompaction_control_test), manual
$ ninja build/dev/test/boost/sstable_datafile_test
$ ./build/dev/test/boost/sstable_datafile_test --run_test=autocompaction_control_test -- -c1 -m2G --overprovisioned --unsafe-bypass-fsync 1 --blocked-reactor-notify-ms 2000000
The test tries to submit a compaction job after playing
with autocompaction control table switch. However, there is
no reliable way to hook pending compaction task. The code
assumed that with_scheduling_group() closure will never
preempt execution of the stats check.
Revert
===
Reverts commit c8247ac. In previous version the execution
sometimes resulted into the following error:
test/boost/sstable_datafile_test.cc(1076): fatal error: in "autocompaction_control_test":
critical check cm->get_stats().pending_tasks == 1 || cm->get_stats().active_tasks == 1 has failed
This version adds a few sstables to the cf, starts
the compaction and awaits until it is finished.
API change
===
- `/column_family/autocompaction/` always returned `true` while answering to the question: if the autocompaction disabled (see https://github.com/scylladb/scylla-jmx/blob/master/src/main/java/org/apache/cassandra/db/ColumnFamilyStore.java#L321). now it answers to the question: if the autocompaction for specific table is enabled. The question logic is inverted. The patch to the JMX is required. However, the change is decent because all old values were invalid (it always reported all compactions are disabled).
- `/column_family/autocompaction/` got support for POST/DELETE per table
Fixes
===
Fixes#1488Fixes#1808Fixes#440
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
Reviewed-by: Glauber Costa <glauber@scylladb.com>
Currently the available actions are documented in several different
places:
* code implementing them
* description
* documentation for --action
* error message that validates value for --action
This is guaranteed to result in incorrect, possibly self-contradicting
documentation. Resolve by generating all documentation from the handler
registry, which now also contains the description of the action.
Also have a separate flag for each action, instead of --action=$ACTION.
Alternator supports four different write isolation policies, the default
being to do all the writes with LWT, but these policies were only briefly
explained in alternator.md.
This patch significantly expands on this explanation, better explaining
the tradeoffs involved in these four options, and when each might make
sense (if at all).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200506235152.18190-1-nyh@scylladb.com>
The single-key sstable reader uses the clustering ranges from the slice
to determine the upper bound of the disk read-range using the index.
For this is simply uses the end bound of the last clustering ranges. For
reverse reads however the clustering ranges in the slice are in reverse
order, so this will in fact be the upper bound of the smallest range.
Depending on whether the distance between the clustering range is big
enough for the sstable reader to use the index to skip between them,
this will lead to either reading too little data or an assert failure.
This patch fixes the problematic function `get_slice_upper_bound()` to
consider reverse reads as well.
Initially I thought there will be more mishandling of reverse slices,
but actually `mutation_fragment_filter`, the component doing the actual
slicing of rows, is already reverse-slice aware.
A unit test which reproduces the assert failure is also added.
Fixes: #6171
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200507114956.271799-1-bdenes@scylladb.com>
"
Garbage collected SSTables, created by incremental compaction process,
are being added to the SSTable set using a function that invalidates
row cache using the range of the SSTable itself. That's incorrect
because data in GC SSTables come from preexisting SSTables in set,
meaning the state of data isn't changed and so no need for
invalidation at all. Incorrect invalidation like this is a source of
read performance issues. This problem is fixed by including GC
SSTables to the descriptor which is used to specify changes to the
SSTable set, which is the correct thing to do given that a midway
failure could leave the set in an incorrect state.
Fixes#5956.
Fixes#6275.
tests: unit(dev)
"
* 'fix_issue_5956_v4' of github.com:raphaelsc/scylla:
sstables/compaction: Don't invalidate row cache when adding GC SSTable to SSTable set
sstables/compaction: Change meaning of compaction_completion_desc input and output fields
sstables/compaction: Clean up code around garbage_collected_sstable_writer
The shutdown process of compaction manager starts with an explicit call
from the database object. However that can only happen everything is
already initialized. This works well today, but I am soon to change
the resharding process to operate before the node is fully ready.
One can still stop the database in this case, but reshardings will
have to finish before the abort signal is processed.
This patch passes the existing abort source to the construction of the
compaction_manager and subscribes to it. If the abort source is
triggered, the compaction manager will react to it firing and all
compactions it manages will be stopped.
We still want the database object to be able to wait for the compaction
manager, since the database is the object that owns the lifetime of
the compaction manager. To make that possible we'll use a future
that is return from stop(): no matter what triggered the abort, either
an early abort during initial resharding or a database-level event like
drain, everything will shut down in the right order.
The abort source is passed to the database, who is responsible from
constructing the compaction manager.
Tests: unit (dev), manual start+stop, manual drain + stop
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200506184749.98288-1-glauber@scylladb.com>
This is unrelated to counters, but happens to fix#4209
`tuple::delayed_value::contains_bind_marker` used to check that
ALL terms are bound (not that ANY of them is bound). As a result,
scylla would crash in prepare codepath for collections of tuples.
After this fix `invalid_request_exception` is thrown instead.
* jul-stas-4209-crash-on-counter-shards-set:
boost/tests: test for bound variable in a list of tuple literals
cql3: fix detection of bound variables in tuples
So that nested exceptions are not lost. Also, marshal exceptions, the
ones we have in these places, already have a backtrace, so might as well
use that, instead of creating a new one, loosing unwound frames.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200507091405.244544-1-bdenes@scylladb.com>
Add a couple of cql tests regarding conditional batches:
1. Verify that "delete" takes priority over "insert"
when applied to the same row within the same batch.
2. Test that a workaround for the issue works as expected (i.e.
delete only individual cells instead of the full record).
Tests: unit(dev)
Fixes: #6273
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200506201200.176590-1-pa.solodovnikov@scylladb.com>
`tuple::delayed_value::contains_bind_marker` used to check that
ALL terms are bound (not that ANY of them is bound). As a result,
scylla would crash in prepare codepath for collections. After this
fix `invalid_request_exception` is thrown instead.
Fixes#4209
We must unregister the monitor upon destruction to prevent use-after-free
from `compaction_backlog_tracker::backlog` path.
This is similar to ~compaction_read_monitor as implemented
in commit ca284174d0Fixes#6385
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200506214419.569655-1-bhalevy@scylladb.com>
In commit da3bf20e71 we supposedly enabled
support for Cassandra's "start_native_transport" option which can be set to
0 to run Scylla without listening on the CQL port. This can be useful, for
example, if a user only want the DynamoDB or Redis APIs but not CQL.
Unfortunately, the option was still marked "Unused", so it wasn't really
enabled as a valid command line option. This patch fixes that, and
documents the start_native_transport option in docs/protocols.md, where
we document the different protocols, ports, and options to configure them.
Fixes#6387.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200506174850.13616-1-nyh@scylladb.com>
fmt, the formatting library we use, detects types with conversion
to std::string_view (and formats them as strings) and types that
support operator<<(std::ostream, const T&) (and performs custom
formatting on them). However, if <fmt/ostream.h>, the latter is
not done.
The problem happens with seastar::sstring, which implements both,
and debug mode, which disables inlining. Some translation units
do include <fmt/ostream.h>, and so generate code to do custom
formatting. exception_utils.cc doesn't, and so generates code
to format via string_view conversion. At link time, the
compiler picks one of the generated functions and includes it
in the final binary; it happened to pick one generated outside
exception_utils.cc, using custom formatting.
However, there is also code in fmt to encode which path fmt
chose - string_view or custom. This code is constexpr and so
is evaluated in exception_utils.cc. The result is that the
function to perform formatting of seastar::sstring uses custom
formatting, while the descriptor containing the method used
says it is formatting via string_view. This is enough to cause
a crash.
The problem is limited to debug mode, since in other modes
all this code is inlined, and so is consistent within the
translation unit.
We need a more general fix (hopefully in fmt), but for now a
simple fix is to add the missing include.
Ref https://github.com/fmtlib/fmt/issues/1662
"
CDC has to create CDC streams that are co-located with corresponding BaseTable data. This is not always easy. Especially for small vnodes. This PR introduces new partitioner which allows us to easily find such stream ids that the stream belongs to a given vnode and shard.
The idea is that a partitioner accepts only keys that are a blob composed of two int64 numbers. The first number is the token of the key.
Tests: unit(dev), dtests(CDC)
"
* haaawk-cdc_partitioner:
cdc:use CDCPartitioner for CDC Log
dht: Add find_first_token_for_shard
dht: use long_token in token::to_int64
cdc: add CDCPartitioner
stream_id: add token_from_bytes static function
i_partitioner: Stop distinguishing whether keys order is preserved
CDC writes are not expected to be read multiple times so it makes little sense
to cache them. Moreover, CDC Log puts much bigger pressure on memory usage than
Base Table because some updates to the Base Table override existing data while
related CDC Log updates are always a new entry in a memtable.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
table::update_cache has two branches of its logic.
One when caching is enabled and the other when it's
disabled. This patch adds unconditional cache invalidation
to the second (disabled caching) branch.
This is done for two purposes. First and foremost, it gives
the guarantee that when we enable the cache later it will be in
the right state and will be ready for usage. This is because
any memtable flush that would logically invalidate the cache,
actually physically does that too now. An additional benefit of this
change is that disabled cache will be cleared during the next
memtable flush that will happen after turning the switch off.
Previously, the cache would also be emptied but it would take
more time before all its elements are removed by eviction.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Previously 'WITH CACHING =' was ignored both in
CREATE TABLE and in ALTER TABLE statements.
Now it will be persisted in schema so that
it can be used later to control caching per table.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Using shared_ptr's in `unrecognized_entity_exception` can lead
to cross-cpu deletion of a pointer which will trigger an assert
`_cpu == std::this_thread::get_id()' when shared_ptr is disposed.
Copy `column_identifier` to the exception object and avoid using
an instance of `cql3::relation`: just get a string representation
from it since nothing more is used in associated exception
handling code.
Fixes: #6287
Tests: unit(dev, debug), dtest(lwt_destructive_ddl_test.py:LwtDestructiveDDLTest.test_rename_column)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200506155714.150497-1-pa.solodovnikov@scylladb.com>
When generating view updates, an endpoint can appear both
as a primary paired endpoint for the view update, and as a pending
endpoint (due to range movements). In order not to generate
the same update twice for the same endpoint, the paired endpoint
is removed from the list of pending endpoints if present.
Fixes#5459
Tests: unit(dev),
dtest(TestMaterializedViews.add_dc_during_mv_insert_test)
Following up on 91b71a0b1a
We also need to serialize storage_service::true_snapshots_size
with snapshot-modifying operations.
It seems like it was assumed that get_snapshot_details
is done under run_snapshot_list_operation, but the one called
here is the table method, not the api::storage_service::get_snapshot_details.
Fixes#5603
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200506115732.483966-1-bhalevy@scylladb.com>
Both `cql3::column_condition` and `cql3::column_condition::raw`
classes are marked as `final`: it's safe to use lw_shared_ptr
instead of generic `seastar::shared_ptr`.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200428202249.82785-1-pa.solodovnikov@scylladb.com>
This patch adds a comprehensive, hopefully complete, test for the
yet-unimplemented FilterExpression feature. FilterExpression is the
modern syntax which allows filtering the results of Query and Scan requests.
The patch includes 50 tests spanning more than 700 lines of code,
testing (hopefully) all the various FilterExpression features,
sub-cases, syntax peculiarities, and so on.
As usual, all included tests pass when run against DynamoDB
("pytest --aws") and xfail when run against Scylla.
This test should be helpful to understand how to implement
FilterExpression correctly, as well as test the future implementation.
Refs #5038.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200503165639.15320-1-nyh@scylladb.com>
We often have to examine raw values, obtained from various sources, like
sstables, logs and coredumps. For some types it is quite simple to
convert raw hex values to human readable ones manually (integers), for
others it is very hard or simply not practical. This command-line tool
aims to ease working with raw values, by providing facilities to print
them in human readable form and compare them. We can extend it with more
functions as needed.
Examples:
$ scylla_types -a print -t Int32Type b34b62d4
-1286905132
$ scylla_types -a compare -t 'ReversedType(TimeUUIDType)' b34b62d46a8d11ea0000005000237906 d00819896f6b11ea00000000001c571b
b34b62d4-6a8d-11ea-0000-005000237906 > d0081989-6f6b-11ea-0000-0000001c571b
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200505124914.104827-1-bdenes@scylladb.com>
* seastar 3c2e27811...e708d1df3 (10):
> Merge "Fix a few issues found by clang's asan" from Rafael
> seastar: app_template: allow a description to be provided for the app
> membarrier: fix madvise(MADV_DONTNEED) failure and crash with --lock-memory
Fixes#6346
> rpc::compressor: Fix static init fiasco with names
> fair_queue: express all internal fair_queue quantities as fair_queue_tickets
> net: remove API v1 compatibility layer (variadic future in networking)
> testing: Move parts of the exchanger out of line
> on_internal_error: add overload taking an std::exception_ptr
> tuple_utils: Add a missing include
> Merge "Fix use of uninitialized found by valgrind" from Rafael
Garbage collected SSTable is incorrectly added to SSTable set with a function
that invalidates row cache. This problem is fixed by adding GC SStable
to set using mechanism which replaces old sstables with new sstables.
Also, adding GC SSTable to set in a separate call is not correct.
We should make sure that GC SSTable reaches the SSTable set at the same time
its respective old (input) SSTable is removed from the set, and that's done
using a single request call to table.
Fixes#5956.
Fixes#6275.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
input_sstables is renamed to old_sstables and is about old SSTables that should be
deleted and removed from the SSTable set.
output_sstables is renamed to new_sstables and is about new SSTable that should be
added to the SSTable set, replacing the old ones.
This will allow us, for example, to add auxiliary SSTables to SSTable set using
the same call which replaces output SSTables by input SSTables in compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This cleanup allows us to get rid of the ugly compaction::create_new_sstable(),
and reduce complexity by getting rid of observable.
garbage_collected_sstable_writer::data is introduced to allow compaction to
directly communicate with the GC writer, which is stored in mutation_compaction,
making it unreachable after the compaction has started. By making compaction
store GC writer's data and using that same data to create g__c__s__w,
compaction is able to communicate with GC writer without the complexity of
observable utility. This move is important for the subsequent work which
will fix a couple of issues regarding management of GC SSTables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Alternator server used to print a startup log line for each shard,
which is redundant and creates churn for nodes with many cores.
Instead of all that, a single line is now printed once alternator
server properly boots.
Fixes#6347
Tests: manual(boot), unit(dev)
Currently the following scenario may happen:
Consider 3 nodes A, B and C and a LWT failed write operation that
managed to get V accepted on A. The value is read twice. First read
access B and C and returns nothing. Next one access A and B, notices
failed round and completes it. Returns value V. Since two consequent
reads without any writes in the middle return different value this
breaks linearisability.
This happens because read does not do full paxos round. The patch
makes read code to reuse the same logic as write by writing a dummy
value which ensures that complete paxos round is used.
This patch change the table snapshot implementation to support multiple
tables.
The method for taking a snapshot using a single table was modified to
use the new implementation.
To support multiple tables, the method now takes a vector of tables and
it loops over it.
Relates to #6333
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Currently the following scenario may happen:
Consider 3 nodes A, B and C and a LWT failed write operation that
managed to get V accepted on A. Next operation may be conditioned on a
value been V, but it may access nodes B and C first and fail. Retrying
the same operation without any writes in the middle may now access A
and B and succeed since it will notice V and will complete previous
transaction. Having to different outcome for the same operation without
any writes in the middle breaks linearisability.
This happens because when condition is unmet we abandon the paxos round,
so this patch makes us complete it with empty value. Now if first
conditional write after failure access B and C it will write accepted
ballot there with the value greater than one of V and V will no longer be
replayed ever.
Change the way query result is passed from getting a reference to a
result to getting a foreign_ptr<lw_shared_ptr<query::result>>. This will
allow cas_request to keep it without copying.
Currently, the test seems to use the tmpdir class in a wrong way,
just to get a path to a temporary directory.
It should keep the tmpdir object around for the duration of the test
so the temporary directory will be automatically removed when the test
completes.
Refs #6344
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200504153810.202218-1-bhalevy@scylladb.com>
This feature will ensure that caching can be switched
off per table only after the whole cluster supports it.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Scylla inherits from Origin two caching parameters
(keys and rows_per_partition) that are ignored.
This patch adds a new parameter called "enabled"
which is true by default and controls whether cache
is used for a selected table or not.
If the parameter is missing in the map then it
has the default value of true. To minimize the impact
of this change, enabled == true is represented as an
absence of this parameter.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Maximum item depth accepted by DynamoDB is 32, and alternator
chose 39 as its arbitrary value in order to provide 7 shining
new levels absolutely free of charge. Unfortunately, our code
which checks the nesting level in rapidjson parsing bumps
the counter by 2 for every object, which is due to rapidjson's
internal implementation. In order to actually support
at least 32 levels, the threshold is simply doubled.
This commit comes with a test case which ensures that
32-nested items are accepted both by alternator and DynamoDB.
The test case failed for alternator before the fix.
Fixes#6366
Tests: unit(dev), alternator(local, remote)
Right now the compaction_manager needs to be started explicitly.
We may change it in the future, but right now that's how it is.
Everything works now even without it, because compaction_manager::stop
happens to work even if it was not started. But it is technically
illegal.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200504143048.17201-1-glauber@scylladb.com>
The issue is that the mount is /var/lib/scylla/coredump ->
/var/lib/systemd/coredump. But we need to do the opposite in order to
save the coredump on the partition that Scylla is using:
/var/lib/systemd/coredump-> /var/lib/scylla/coredump
Fixes#6301
"
Fixes#6067
Makes the scylla endpoint initializations that support TLS use reloadable certificate stores, watching used cert + key files for changes, and reload iff modified.
Tests in separate dtest set.
"
* elcallio-calle/reloadable-tls:
transport: Use reloadable tls certificates
redis: Use reloadable tls certificates
alternator: Use reloadable tls certificates
messaging_service: Use reloadable TLS certificates
Changes messaging service rpc to use reloadable tls
certificates iff tls is enabled-
Note that this means that the service cannot start
listening at construction time if TLS is active,
and user need to call start_listen_ex to initialize
and actually start the service.
Since "normal" messaging service is actually started
from gms, this route too is made a continuation.
std::gmtime() has a sad property of using a global static buffer
for returning its value. This is not thread-safe, so its usage
is replaced with gmtime_r, which can accept a local buffer.
While no regressions where observed in this particular area of code,
a similar bug caused failures in alternator, so it's better to simply
replace all std::gmtime calls with their thread-safe counterpart.
Message-Id: <39e91c74de95f8313e6bb0b12114bf12c0e79519.1588589151.git.sarna@scylladb.com>
Add EX option for SET command, to set TTL for the key.
A behavior of SET EX is same as SETEX command, it just different syntax.
see: https://redis.io/commands/set
Scylla returns the wrong error code (0000 - server internal error)
in response to trying to do authentication/authorization operations
that involves a non-existing role.
This commit changes those cases to return error code 2200 (invalid
query) which is the correct one and also the one that Cassandra
returns.
Tests:
Unit tests (Dev)
All auth and auth_role dtests
The compaction_manager test lives inside a thread and it is not taking
advantage of it, with continuations all over.
One of the side effects of it is that the test is calling stop() twice
on the compaction_manager. While this works today, it is not good
practice. A change I am making is just about to break it.
This patch converts the test to fully use .get() instead of chained
continuations and in doing so also guarantees that the compaction
manager will be RAII-stopped just one, from a defer object.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200503161420.8346-2-glauber@scylladb.com>
Local system tables from `system` namespace use LocalStrategy
replication, so they do not need to be concerned about gc grace
period. Some system tables already set gc grace period to 0,
but other ones, including system.large_partitions, did not.
That may result in millions of tombstones being needlessly
kept for these tables, which can cause read timeouts.
Fixes#6325
Tests: unit(dev), local(running cqlsh and playing with system tables)
"
Intrusive containers often have references between containers elements
that point to some non-first word of the element. This references
currently fly below the radar of `scylla find` and `scylla
generate-object-graph`, as they are looking to references to only the
first word of the objects. So objects that are members of an intrusive
container often appear to have no inbound references at all.
This patch-set improves support for finding such references by looking
for references to non-first words of objects.
It also includes some generic, minor improvements to scylla
generate_object_graph.
"
* 'scylla-gdb.py-scylla-generate-object-graph-linked-lists/v1' of https://github.com/denesb/scylla:
scylla-gdb.py: scylla generate_object_graph: make label of initial vertice bold
scylla-gdb.py: scylla generate_object_graph: remove redundant lookup
scylla-gdb.py: scylla generate_object_graph: print "to" offsets
scylla-gdb.py: scylla generate-object-graph: use value-range to find references
scylla-gdb.py: scylla find: allow finding ranges of values
scylla-gdb.py: find_in_live(): return pointer_metadata instances
Loading SSTables from the main directory is possible, to be compatible with
Cassandra, but extremely dangerous and not recommended.
From the beginning, we recommend using an separate, upload/ directory.
In all this time, perhaps due to how the feature's usefulness is reduced
in Cassandra due to the possible races, I have never seen anyone coming
from Cassandra doing procedures involving refresh at all.
Loading SSTables from the main directory forces us to disable writes to
the table temporarily until the SSTables are sorted out. If we get rid of
this, we can get rid of the disabling of the writes as well.
We can't do it now because if we want to be nice to the odd user that may
be using refresh through the main directory without our knowledge we should
at least error out.
This patch, then, does that: it errors out if SSTables are found in the main
directory. It will not proceed with the refresh, and direct the user to the
upload directory.
The main loop in reshuffle_sstables is left in place structurally for now, but
most of it is gone. The test for is is deleted.
After a period of deprecation we can start ignoring these SSTables and get rid
of the lock.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200429144511.13681-1-glauber@scylladb.com>
When using interposers, cancelling compactions can leave futures
that are not waited for (resharding, twcs)
The reason is when consume_end_of_stream gets called, it tries to
push end_of_stream into the queue_reader_handle. Because cancelling
a compaction is done through an exception, the queue_reader_handle
is terminated already at this time. Trying to push to it generates
another exception and prevents us from returning the future right
below it.
This patch adds a new method is_terminated() and if we detect
that the queue_reader_handle is already terminated by this point,
we don't try to push. We call it is_terminated() because the check
is to see if the queue_reader_handle has a _reader. The reader is
also set to null on successful destruction.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200430175839.8292-1-glauber@scylladb.com>
Background:
Replace operation is used to replace a dead node in the cluster.
Currently during replace operation, the replacing node does not take any
writes. As a result, new writes to a range after the sync for that range
is done, e.g., after streaming for that range is finished, will not be
synced to the replacing node. Hinted hand off or repair after the
replacing operation will help. But it is better if we can make the
writes to the replacing node to avoid any post replacing operation
actions.
After this series and repair based node operation series, the replace
operation will guarantee the replacing node has all the latest copy of
data including the new writes during the replace operation. In short, no
more repairs before or after the replacing operation. Just replacing the
node is enough.
Implementation:
Filter the node being replaced out of the natural endpoints in
storage_proxy, so that:
The node being replaced will not be selected as the target for
normal write or normal read.
Do not depend on the gossip liveness to avoid selecting replacing node
for normal write or normal read when the replacing node has the same
ip address as the node being replaced. No more special handling for
hibernate state in gossip which makes it is simpler and more robust.
Replacing node will be marked as UP.
Put the replacing node in the pending list, so that:
Replacing node will take writes but write to replacing will not be
counted as CL.
Replacing node will not take normal read.
Example:
For example, with RF = 3, n1, n2, n3 in the cluster, n3 is dead and
being replaced by node n4. When n4 starts:
writes to nodes {n1, n2, n3} are changed to
normal_replica_writes = {n1, n2} and pending_replica_writes= {n4}.
reads to nodes {n1, n2, n3} are changed to
normal_replica_reads = {n1, n2} only.
This way, the replacing node n4 now takes writes but does not take reads.
Tests:
Measure the number of writes during pending period that is the
replacing starts and finishes the replace operation.
Start 5 nodes, n1 to n5.
Stop n5
Start write in the background
Start n6 to replace n5
Get scylla_database_total_writes metrics when the replacing node announces HIBERNATE (replacing) and NORMAL status.
Before:
2020-02-06 08:35:35.921837 Get metrics when other knows replacing node = HIBERNATE
2020-02-06 08:35:35.939493 scylla_database_total_writes: node1={'scylla_database_total_writes': 15483}
2020-02-06 08:35:35.950614 scylla_database_total_writes: node2={'scylla_database_total_writes': 15857}
2020-02-06 08:35:35.961820 scylla_database_total_writes: node3={'scylla_database_total_writes': 16195}
2020-02-06 08:35:35.978427 scylla_database_total_writes: node4={'scylla_database_total_writes': 15764}
2020-02-06 08:35:35.992580 scylla_database_total_writes: node6={'scylla_database_total_writes': 331}
2020-02-06 08:36:49.794790 Get metrics when other knows replacing node = NORMAL
2020-02-06 08:36:49.809189 scylla_database_total_writes: node1={'scylla_database_total_writes': 267088}
2020-02-06 08:36:49.823302 scylla_database_total_writes: node2={'scylla_database_total_writes': 272352}
2020-02-06 08:36:49.837228 scylla_database_total_writes: node3={'scylla_database_total_writes': 274004}
2020-02-06 08:36:49.851104 scylla_database_total_writes: node4={'scylla_database_total_writes': 262972}
2020-02-06 08:36:49.862504 scylla_database_total_writes: node6={'scylla_database_total_writes': 513}
Writes = 513 - 331
After:
2020-02-06 08:28:56.548047 Get metrics when other knows replacing node = HIBERNATE
2020-02-06 08:28:56.560813 scylla_database_total_writes: node1={'scylla_database_total_writes': 290886}
2020-02-06 08:28:56.573925 scylla_database_total_writes: node2={'scylla_database_total_writes': 310304}
2020-02-06 08:28:56.586305 scylla_database_total_writes: node3={'scylla_database_total_writes': 304049}
2020-02-06 08:28:56.601464 scylla_database_total_writes: node4={'scylla_database_total_writes': 303770}
2020-02-06 08:28:56.615066 scylla_database_total_writes: node6={'scylla_database_total_writes': 604}
2020-02-06 08:29:10.537016 Get metrics when other knows replacing node = NORMAL
2020-02-06 08:29:10.553257 scylla_database_total_writes: node1={'scylla_database_total_writes': 336126}
2020-02-06 08:29:10.567181 scylla_database_total_writes: node2={'scylla_database_total_writes': 358549}
2020-02-06 08:29:10.581939 scylla_database_total_writes: node3={'scylla_database_total_writes': 351416}
2020-02-06 08:29:10.595567 scylla_database_total_writes: node4={'scylla_database_total_writes': 350580}
2020-02-06 08:29:10.610548 scylla_database_total_writes: node6={'scylla_database_total_writes': 45460}
Writes = 45460 - 604
As we can see the replacing node did not take write before and take write after the patch.
Check log of writer handler in storage_proxy
storage_proxy - creating write handler for token: -2642068240672386521,
keyspace_name=ks, original_natrual={127.0.0.1, 127.0.0.5, 127.0.0.2},
natural={127.0.0.1, 127.0.0.2}, pending={127.0.0.6}
The node being replaced, n5=127.0.0.5, is filtered out and the replacing
node, n6=127.0.0.6 is in the pending list.
* asias/replace_take_writes:
storage_service: Make replacing node take writes
repair: Use token_metadata with the replacing node in do_rebuild_replace_with_repair
abstract_replication_strategy: Add get_ranges which takes token_metadata
abstract_replication_strategy: Add get_natural_endpoints_without_node_being_replaced
abstract_replication_strategy: Add allow_remove_node_being_replaced_from_natural_endpoints
token_metadata: Calculate pending ranges for replacing node
storage_service: Unify handling of replaced node removal from gossip
storage_service: Update tokens and replace address for replace operation
The .get_ep_stat(ep) call can throw when registering metrics (we have
issue for it, #5697). This is not expected by it callers, in particular
abstract_write_response_handler::timeout_cb breaks in the middle and
doesn't call the on_timeout() and the _proxy->remove_response_handler(),
which results in not removed and not released responce handler. In turn
not released response handler doesn't set the _ready future on which
response_wait() waits -> stuck.
Although the issue with .get_ep_stat() should be fixed, an exception in
it mustn't lead to deadlocks, so the fix is to make the get_ep_stat()
noexcept by catching the exception and returning a dummy stat object
instead to let caller(s) finish.
Fixes#5985
Tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200430163639.5242-1-xemul@scylladb.com>
"
This series fix hang in multishard_writer when error happens. It contains
- multishard_writer: Abort the queue attached to consumers when producer fails
- repair: Fix hang when the writer is dead
Fixes#6241
Refs: #6248
"
* asias-stream_fix_multishard_writer_hang:
repair: Fix hang when the writer is dead
mutation_writer_test: Add test_multishard_writer_producer_aborts
multishard_writer: Abort the queue attached to consumers when producer fails
"
The backlog_controller has a timer that periodically accesses the
sstable writers of ongoing writes.
This patch series makes sure we remove entries from the list of ongoing
writes before the corresponding sstable writer is destroyed.
Fixes#6221.
"
* 'espindola/fix-6221-v5' of https://github.com/espindola/scylla:
sstables: Call revert_charges in compaction_write_monitor::write_failed
sstables: Call monitor->write_failed earlier.
sstables: Add write_failed to the write_monitor interface
"Enabling TTL feature, add setex and ttl commands to use it."
* 'redis_setex_ttl' of git://github.com/syuu1228/scylla:
redis: add test for setex/ttl
redis: add ttl command
redis: add setex command
"Add test for lolwut command, and also fix a bug on lolwut found by the test."
* 'redis_lolwut_test' of git://github.com/syuu1228/scylla:
redis: lolwut parameter fix
redis-test: add lolwut test
Background:
Replace operation is used to replace a dead node in the cluster.
Currently during replace operation, the replacing node does not take any
writes. As a result, new writes to a range after the sync for that range
is done, e.g., after streaming for that range is finished, will not be
synced to the replacing node. Hinted hand off or repair after the
replacing operation will help. But it is better if we can make the
writes to the replacing node to avoid any post replacing operation
actions.
After this series and repair based node operation series, the replace
operation will guarantee the replacing node has all the latest copy of
data including the new writes during the replace operation. In short, no
more repairs before or after the replacing operation. Just replacing the
node is enough.
Implementation:
1) Filter the node being replaced out of the natural endpoints in
storage_proxy, so that:
- The node being replaced will not be selected as the target for
normal write or normal read.
- Do not depend on the gossip liveness to avoid selecting replacing node
for normal write or normal read when the replacing node has the same
ip address as the node being replaced. No more special handling for
hibernate state in gossip which makes it is simpler and more robust.
Replacing node will be marked as UP.
2) Put the replacing node in the pending list, so that:
- Replacing node will take writes but write to replacing will not be
counted as CL.
- Replacing node will not take normal read.
Example:
For example, with RF = 3, n1, n2, n3 in the cluster, n3 is dead and
being replaced by node n4. When n4 starts:
- writes to nodes {n1, n2, n3} are changed to
normal_replica_writes = {n1, n2} and pending_replica_writes= {n4}.
- reads to nodes {n1, n2, n3} are changed to
normal_replica_reads = {n1, n2} only.
This way, the replacing node n4 now takes writes but does not take reads.
Tests:
1) Measure the number of writes during pending period that is the
replacing starts and finishes the replace operation.
- Start 5 nodes, n1 to n5.
- Stop n5
- Start write in the background
- Start n6 to replace n5
- Get scylla_database_total_writes metrics when the replacing node announces HIBERNATE (replacing) and NORMAL status.
Before:
2020-02-06 08:35:35.921837 Get metrics when other knows replacing node = HIBERNATE
2020-02-06 08:35:35.939493 scylla_database_total_writes: node1={'scylla_database_total_writes': 15483}
2020-02-06 08:35:35.950614 scylla_database_total_writes: node2={'scylla_database_total_writes': 15857}
2020-02-06 08:35:35.961820 scylla_database_total_writes: node3={'scylla_database_total_writes': 16195}
2020-02-06 08:35:35.978427 scylla_database_total_writes: node4={'scylla_database_total_writes': 15764}
2020-02-06 08:35:35.992580 scylla_database_total_writes: node6={'scylla_database_total_writes': 331}
2020-02-06 08:36:49.794790 Get metrics when other knows replacing node = NORMAL
2020-02-06 08:36:49.809189 scylla_database_total_writes: node1={'scylla_database_total_writes': 267088}
2020-02-06 08:36:49.823302 scylla_database_total_writes: node2={'scylla_database_total_writes': 272352}
2020-02-06 08:36:49.837228 scylla_database_total_writes: node3={'scylla_database_total_writes': 274004}
2020-02-06 08:36:49.851104 scylla_database_total_writes: node4={'scylla_database_total_writes': 262972}
2020-02-06 08:36:49.862504 scylla_database_total_writes: node6={'scylla_database_total_writes': 513}
Writes = 513 - 331
After:
2020-02-06 08:28:56.548047 Get metrics when other knows replacing node = HIBERNATE
2020-02-06 08:28:56.560813 scylla_database_total_writes: node1={'scylla_database_total_writes': 290886}
2020-02-06 08:28:56.573925 scylla_database_total_writes: node2={'scylla_database_total_writes': 310304}
2020-02-06 08:28:56.586305 scylla_database_total_writes: node3={'scylla_database_total_writes': 304049}
2020-02-06 08:28:56.601464 scylla_database_total_writes: node4={'scylla_database_total_writes': 303770}
2020-02-06 08:28:56.615066 scylla_database_total_writes: node6={'scylla_database_total_writes': 604}
2020-02-06 08:29:10.537016 Get metrics when other knows replacing node = NORMAL
2020-02-06 08:29:10.553257 scylla_database_total_writes: node1={'scylla_database_total_writes': 336126}
2020-02-06 08:29:10.567181 scylla_database_total_writes: node2={'scylla_database_total_writes': 358549}
2020-02-06 08:29:10.581939 scylla_database_total_writes: node3={'scylla_database_total_writes': 351416}
2020-02-06 08:29:10.595567 scylla_database_total_writes: node4={'scylla_database_total_writes': 350580}
2020-02-06 08:29:10.610548 scylla_database_total_writes: node6={'scylla_database_total_writes': 45460}
Writes = 45460 - 604
As we can see the replacing node did not take write before and take write after the patch.
2) Check log of writer handler in storage_proxy
storage_proxy - creating write handler for token: -2642068240672386521,
keyspace_name=ks, original_natrual={127.0.0.1, 127.0.0.5, 127.0.0.2},
natural={127.0.0.1, 127.0.0.2}, pending={127.0.0.6}
The node being replaced, n5=127.0.0.5, is filtered out and the replacing
node, n6=127.0.0.6 is in the pending list.
Fixes: #5482
We will change the update of tokens in token_metadata in the next patch
so that the tokens of the replacing node are updated to token_metadata
only after the replace operation is done. In order to get the correct
ranges for the replacing node in do_rebuild_replace_with_repair, we need
to use a copy of token_metadata contains the tokens of the replacing
node.
Refs: #5482
It is useful when the caller wants to calculate ranges using a
custom token_metadata.
It will be used soon in do_rebuild_replace_with_repair for replace
operation.
Refs: #5482
Decide if the replication strategy allow removing the node being replaced from
the natural endpoints when a node is being replaced in the cluster.
LocalStrategy is the not allowed to do so because it always returns the node
itself as the natural_endpoints and the node will not appear in the
pending_endpoints.
It is needed by the "Make replacing node take writes" work.
Refs: #5482
The make_descriptor() function parses a string representation of sstable
version using a ternary operator. Clean it up by using
sstables::from_string(), which is future-proof when we add support for
later sstable formats.
Message-Id: <20200429082126.15944-1-penberg@scylladb.com>
Currently, after the replacing node finishes the replace operation, it
removes the node being replaced from gossip directly in
storage_service::join_token_ring() with gossiper::replaced_endpoint(),
so the gossip states for the replaced node is gone.
When other nodes knows the replace operation is done, they will call
storage_service::remove_endpoint() and gossiper::remove_endpoint() to
quarantine the node but keep the gossip states. To prevent the
replacing node learns the state of replaced node again from existing
node again, the replacing node uses 2X quarantine time.
This makes the gossip states for the replaced node different on other
nodes and replacing nodes. It makes it is harder to reason about the
gossip states because the discrepancy of the states between nodes.
To fix, we unify the handling of replaced node on both replacing node
and other nodes. On all the nodes, once the replacing node becomes
NORMAL status, we remove the replaced node from token_metadata and
quarantine it but keep the gossip state. Since the replaced node is no
longer a member of the cluster, the fatclient timer will count and
expire and remove the replaced node from gossip.
Refs: #5482
The motivation is to make the replacing node has the same view of the
token ring as the rest of the cluster.
If the replacing node has the same ip of the node being replaced, we
should update the tokens in token_metadata when the replace operation
starts, so that this replacing node and the rest of the cluster see the
same token ring.
If the replacing node has the different ip address of the node being
replaced, we should update the tokens in token_metadata only when
replace operation is done, because the other nodes will update the
replacing node's token in token_metadata when the replace operation is
done.
Refs: #5482
The alternator test, test/alternator/run, runs Scylla and runs the
various tests against it. Before this patch, just booting Scylla took
about 26 seconds (for a dev build, on my laptop). This patch reduces
this delay to less than one second!
It turns out that almost the entire delay was artificial, two periods
of 12 seconds "waiting for the gossip to settle", which are completely
unnecessary in the one-node cluster used in the Alternator test.
So a simple "--skip-wait-for-gossip-to-settle 0" parameter eliminates
these long delays completely.
Amusingly, the Scylla boot is now so fast, that I had to change a "sleep 2"
in the test script to "sleep 1", because 2 seconds is now much more than
it takes to boot Scylla :-)
Fixes#6310.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200428145035.22894-1-nyh@scylladb.com>
Client drivers act differently on errors codes they don't recognize.
Adding new errors codes is considered a protocol extension and
should be negotiated with the client.
This change keeps `overflow_error_exception` internally but uses
the INVALID cql error code to return the error message back to the client
similar to keyspace_not_defined_exception.
We (and cassandra) already use `invalid_request_exception` extensively
to return various errors related to invalid values or types in the query.
Fixes#6264
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Gleb Natapov <gleb@scylladb.com>
Message-Id: <20200422130011.108003-1-bhalevy@scylladb.com>
The ScanIndexForward parameter is now fully implemented
and can accept ScanIndexForward=false in order to query
the partitions in reverse clustering order.
Note that reading partition slices in reverse order is less
efficient than forward scans and may put a strain on memory
usage, especially for large partitions, since the whole partition
is currently fetched in order to be reversed.
Fixes#5153
1. Remove the `versioned_value::factory` class, it didn't add any value. It just
forced us to create an object for making `versioned_value`s, for no sensible
reason.
2. Move some `versioned_value` deserialization code (string -> internal data
structures) into the versioned_value module. Previously, it was scattered all
around the place.
3. Make `gossiper::get_seeds` const and return a const reference.
I needed these refactors for a PR I was preparing to fix an issue with CDC. The
attempt of fixing the issue failed (I'm trying something different now), but the
refactors might be useful anyway.
* kbr--vv-refactor:
gossiper: make `get_seeds` method const and return a const ref
versioned_value: remove versioned_value::factory class
gms: move TOKENS string deserialization code into versioned_value
This patch allows users of storage_proxy::cas() to supply nullptr
as `query::read_command` which is supposed to skip the procedure
of reading the existing value.
The feature is used in alternator code for Read-Modify-Write
operations: some of them don't require reading previous item
values before updating.
Move `read_nothing_read_command` from alternator code to
storage_proxy layer and fabricate a new no-op command from it when
storage_proxy::cas() is used with nullptr read_command.
This allows to avoid sprinkling if-else branches all over the code
in order to check for null-equality of `cmd`.
We return from storage_proxy::query() very early with an empty
result in case we're given an empty partition_slice (which resides
inside the passed `read_command`) so this approach should be
perfectly fine.
Expand documentation for the `cas()` function to cover new
possible value for `cmd` argument.
Fixes: #6238
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200428065235.5714-1-pa.solodovnikov@scylladb.com>
Consdier:
When repair master gets data from repair follower:
1) apply_rows_on_master_in_thread is called
2) a repair writer is created with _repair_writer.create_writer
3) the repair writer fails
4) data is written to the queue _mq[node_idx]->push_eventually attached
with the writer
Since the writer is dead. No one is going to fetch data from the _mq
queue. The apply_rows_on_master_in_thread will block forever.
To fix, when the writer is failed, we should abort the _mq queue.
Refs: #6248
If no keyspace is specified when taking snapshot, there will be a segfault
because keynames is unconditionally dereferenced. Let's return an error
because a keyspace must be specified when column families are specified.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200427195634.99940-1-raphaelsc@scylladb.com>
This patch adds a '--with-seastar=<PATH>' option to configure.py, which
allows user to override the default seastar submodule path. This is
useful when building packages from source tarballs, for example.
Message-Id: <20200427165511.6448-1-penberg@scylladb.com>
We still call it in the destructor or to cover the successful case. We
can't do that in on_data_write_completed because it is too early.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
A writer is destroyed just before consume_in_thread returns, since the
adapter takes ownership of it.
The problem is that a monitor can keep a reference to the a
writer_offset_tracker that is owned by that writer.
The monitor is accessed periodically via
backlog_controller::_update_timer. This means we have to deregister
from the list of ongoing writes before the writer is destroyed.
If the write fails, the deregistration happens in write_failed, but it
is currently called after the writer is destroyed.
This patch moves the call to write_failed to the writer destructor as
I could not find a convenient location to put it.
Since the writer is destroyed in consume_in_thread, we could call it
there, but then we also have to update consume.
The is a similar problem with the case where the sstable is written
correctly. That will be fixed in the next patch.
Fixes#6221.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Only database_sstable_write_monitor needs it so far, but the call
needs to be moved earlier, which requires calling it in code paths
that don't know about database_sstable_write_monitor.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
that code doesn't run under a thread, so let's futurize it.
the code worked with single cpu because get() returns right away
due to no deferring point.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200427155303.82763-1-raphaelsc@scylladb.com>
The Alternator test's run script, test/alternator/run, runs Scylla.
By default, it chooses the last built Scylla executable build/*/scylla.
However, test.py has a "mode" option, that should be able to choose which
build mode to run. Before this patch, this mode option wasn't honored by
the Alternator test, so a "test.py alternator/run" would run the same
Scylla binary (the one last built) three times, instead of running each
of the three build modes.
We fix this in this patch: test.py now passes the "SCYLLA" environment
variable to the test/alternator/run script, indicating the location of the
Scylla binary with the appropriate build mode. The script already supported
this environment variable to override its default choice of Scylla binary.
In test.py, we add to the run_test() function an optional "env" parameter
which can be used to pass additional environment variables to the test.
Fixes#6286
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200427131958.28248-1-nyh@scylladb.com>
If you run "dbuild" on a freshly installed machine, the error message is
not the most helpful one. Fix it up.
Before:
$ ./tools/toolchain/dbuild
./tools/toolchain/dbuild: line 113: docker: command not found
./tools/toolchain/dbuild: line 156: docker: command not found
After:
$ ./tools/toolchain/dbuild
dbuild: Please install Docker on this machine to run dbuild.
Run `./tools/toolchain/dbuild --help' to print the full help message.
Message-Id: <20200426192746.11034-1-penberg@scylladb.com>
Fixes#6202
Distributed loader sstable opening is gated through the
database::sstable_load_concurrency_sem() semaphore
(at a concurrency of 3).
This is (according to creation comment) to reduce memory footprint
during bootstrap, by partially serializing the actual opening of
existing sstables.
However, in certain versions of the product, there exist circular
dependencies between data in some sstables and the ability to actually
read others. Thus when gated as above, we can end up with the
dependents acquiring the semaphore fully, and once stuck waiting for
population of their dependency effectively blocking this from ever
happening.
Since we probably do not want to remove the concurrency control,
and increasing it would only push the problem further away,
we solve the issue by adding the ability to mark certain keyspaces
as "prioritized" (pre-bootstrap), and allow them to populate outside
the normal concurrency control semaphore. Concurrency increase is
however limited to one extra sstable per shard and prio keyspace.
Message-Id: <20200415102431.20816-1-calle@scylladb.com>
from Juliusz.
CL of LOCAL_QUORUM used to be hardcoded into CDC preimage query
and led to an error every time the number of replicas was lower than CL
could require. The solution here is to link the CLs of writes
to base table with the CLs of CDC reads, so the client will get
the (limited) control over the consistency of preimage SELECTs
(instead of constant misleading errors).
The algorithm is as follows:
1. If write that caused CDC activity was done with CL = ANY,
then do preimage read with CL = ONE.
2. If write that caused CDC activity was done with CL = ALL,
then do preimage read with CL = QUORUM.
3. SERIAL and LOCAL_SERIAL writes cause preimage read with QUORUM
and LOCAL_QUORUM, respectively.
4. In other cases do preimage read with the same CL as base write.
To further mitigate the incomprehensible error being sent to client,
I wrapped the preimage's SELECT query in try-catch and
intercept the `unavailable_exception`, which was manifesting as
`NoHostAvailable` in Python and Java drivers. Now client gets a new
error code and a message specific to the issue of CL not being met
by the preimage query.
Fixes#5746
* jul-stas-5746-cdc-replication-factor:
cdc: fix the "NoHostAvailable" client error when CL is not met
cdc: CL for preimage select is calculated from base write CL
This commit resolves the client-observable effect of CDC read
consistencies. I wrapped the preimage's SELECT query in try-catch to
intercept the `unavailable_exception`, which led to misleading
`NoHostAvailable` in Python and Java drivers. Now client gets a new
error code and a message specific to the issue of CL not being met
by the preimage query.
Fixes#5746
Queries with `ALLOW FILTERING` and constraints on counter
values used to be rejected as "unimplemented". The reason
was a missing tri-comparator, which is added in this patch.
Fixes#5635
* jul-stas-5635-filtering-on-counters:
cql/tests: Added test for filtering on counter columns
counters: add comparator and remove `unimplemented` from restrictions
The xxhash library has been packaged by Fedora, so we can use it
instead of carrying the submodule. This reduces allows us to
receive updates as the OS packages are updated. Build time will
not be reduced since it is a header-only library.
xxhash preserves the hash results across versions so rolling
upgrades will still work.
The frozen toolchain is updated with the new package.
Tests: unit (dev)
On Centos 7 machine:
fstrim.timer not enabled, only unmasked due scylla_fstrim_setup on installation
When trying run scylla-fstrim service manually you get error:
Traceback (most recent call last):
File "/opt/scylladb/scripts/libexec/scylla_fstrim", line 60, in <module>
main()
File "/opt/scylladb/scripts/libexec/scylla_fstrim", line 44, in main
cfg = parse_scylla_dirs_with_default(conf=args.config)
File "/opt/scylladb/scripts/scylla_util.py", line 484, in parse_scylla_dirs_with_default
if key not in y or not y[k]:
NameError: name 'k' is not defined
It caused by error in scylla_util.py
Fixes#6294.
The "jobs" script is used to determine the amount of compilation
parallelism on a machine. It attempts to ensure each GCC process has at
least 4 GB of memory per core. However, in the worst case scenario, we
could end up having the GCC processes take up all the system memory,
forcin swapping or OOM killer to kick in. For example, on a 4 core
machine with 16 GB of memory, this worst case scenario seems easy to
trigger in practice.
Fix up the problem by keeping a 1 GB of memory reserve for other
processes and calculating parallelism based on that.
Message-Id: <20200423082753.31162-1-penberg@scylladb.com>
When generating tokens for parallel scan, debug mode undefined behavior
sanitizer complained that integer overflow sometimes happens when
multiplying two big values - delta and segment number.
In order to mitigate this warning, the multiplication is now split
into two smaller ones, and the generated machine code remains
identical (verified on gcc and clang via compiler explorer).
Fixes#6280
Tests: unit(dev)
There's no indication that data needed for generating view updates
from staging sstables is going to be immediately useful for the
user, and a large amount of it can push hot rows out of the cache,
thus deteriorating performance.
Fixes#6233
Tests: unit(dev)
We want to test that a std::bad_alloc is thrown, but GCC 10 has a new
optimization (-fallocation-dce) that removes dead allocations.
This patch assigns the value returned by new to a global so that GCC
cannot delete it.
With this all tests in a dev build pass with GCC 10.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200424201531.225807-1-espindola@scylladb.com>
The configure option is --static-stdc++, to is surprising that it also
enables -static-libgcc.
Also, -static-libgcc doesn't seem to work with debug builds.
This patch removes -static-libgcc which fixes debug builds with
--static-stdc++. Such builds are convenient for testing new versions
of gcc.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200424214117.257195-1-espindola@scylladb.com>
This patch has two goals -- speed up the total partitions
calculations (walking databases is faster than walking tables),
and get rid og row_cache._partitions.size() call, which will
not be available on new _partitions collection implementation.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200423133900.27818-1-xemul@scylladb.com>
Current code gets table->row_cache->cache_tracker->region and sums
up the region's used space for all tables found.
The problem is that all row_cache-s share the same cache_tracker
object from the database, thus the resulting number is not correct.
Fix this by walking cache_tracker-s from databases instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200423133755.27187-1-xemul@scylladb.com>
These callbacks can block a seastar thread and the underlying vector
can be reallocated concurrently.
This is no different than if it was a plain std::vector and the
solution is similar: use values instead of references.
Fixes#6230
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200422182304.120906-1-espindola@scylladb.com>
This produces more compact code and avoids the anti-pattern of
building a map with statically known values. If the values are given
to GCC via a switch statement it can do a much better job at compile
time than libstdc++ can at runtime.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200422224905.198794-1-espindola@scylladb.com>
Alternator is supposed to use RF=3 for new tables. Only when the cluster is
smaller than 3 nodes do we use RF=1 (and warn about it) - this is useful for
testing.
However, our implementation incorrectly tested the number of *live* nodes in
the cluster instead of the total number of nodes. As a result, if a 3-node
cluster had one node down, and a new table was created, it was created with
RF=1, and immediately could not be written because when RF=1, any node down
means part of the data is unavailable.
This patch fixes this: The total number of nodes in the cluster - not the
number of live nodes - is consulted. The three-node-cluster-with-a-dead-node
setup above creates the table with RF=3, and it can be written because two
living nodes out of three are enough when RF=3 and we do quorum writes and
reads.
We have a dtest to reproduce this bug (and its fix), and it's also easy to
reproduce manually by starting a 3-node cluster, killing one of the nodes,
and then running "pytests". Before this patch, the tests can create tables
but then fail to write to them. After this patch, the test succeed on the
same cluster with the dead node.
Fixes#6267
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200422182035.15106-2-nyh@scylladb.com>
The gossiper has a convenience functions get_up_endpoint_count() and
get_down_endpoint_count(), but strangely no function to get the total
number. Even though it's easy to calculate the total by summing up their
result it is inefficient and also incovenient because of of these
functions returns a future.
So let's add another function, get_all_endpoint_count(), to get the
total number of nodes. We will use this function in the next patch.
Signed-off-by: Nadav Har'El <n...@scylladb.com>
Message-Id: <20200422182035.15106-1-nyh@scylladb.com>
After fixing issue #6260, the "parallel scan" feature in Alternator is
supported, so drop the sentence in alternator.md saying that it isn't.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200422090738.21648-1-nyh@scylladb.com>
The Alternator test boots Scylla to test against it. We set an arbitrary
timeout for this boot to succeed: 100 seconds. This 100 seconds is
significantly more than 25 seconds it takes on my laptop, and I though
we'll never reach it. But it turns out that in some setups - running the
very slow debug build on slow and overcommitted nodes - 100 seconds is
not enough.
So this patch doubles the timeout to 200 seconds.
Note that this "200 seconds" is just a timeout, and doesn't affect normal
runs: Both a successful boot and a failed boot are recognized as soon as
they happen, and we never unnecessarily wait the entire 200 seconds.
Fixes#6271.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200422193920.17079-1-nyh@scylladb.com>
This will allow deterministic stream_id generation
and would remove the risk of not being able to generate
a stream id for some vnode.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
* seastar-dev.git gleb/lwt-shared-proposal:
lwt: pass paxos::proposal as a shared pointer everywhere
lwt: do not copy proposal in paxos_state::accept
lwt: make load_paxos_state to take partition_key_view instead of a
deference
Some caller have partition_key_view, but not partition_key, so thy need
to create a temporary and copy just to pass a reference. Change it by
accepting a view.
paxos::proposal reference is passed into a lot of functions and sometimes
it has to be copied to prolong its lifetime. Create it as a shared
pointer and pass it everywhere to avoid those copies.
Fixes#6265
Return type for read_log_file was previously changed from
subscription to future<>, returning the previously returned
subscriptions result of done(). But it did not preserve the
subscription itself, which in turn will cause us to (in
work::stream), call back into a deleted object.
Message-Id: <20200422090856.5218-1-calle@scylladb.com>
Parallel scans can be performed by providing Segment and TotalSegments
attributes to Scan request, which can be used to split the work among
many workers.
This test makes the parallel scan test succeed, so the xfail is removed.
Fixes#5059
The constructor of `read_command` is used both by IDL and clients in the
code. However, this constructor has a parameter that is not used by IDL:
`read_timestamp`. This requires that this parameter is the very last in
the list and that new parameters that are used by IDL are added before
it. One such new parameter was `bool is_first_page`. Adding this
parameter right before the read timestamp one created a situation where
the last parameter (read_timestamp) implicitly converts to the one
before it (is_first_page). This means that some call sites passing
`read_timestamp` were now silently converting this to `is_first_page`,
effectively dropping the timestamp.
This patch aims to rectify this, while also avoiding similar accidents
in the future, by making `is_first_page` a `bool_class` which doesn't
have any implicit convertions defined. This change does not break the
ABI as `bool_class` is also sent as a `bool` on the wire.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Tests: unit(dev)
Message-Id: <20200422073657.87241-1-bdenes@scylladb.com>
Casts only depend on their operands, so a plain function pointer is
sufficient. This allows replacing all the make_castas_* functions that
return a lambda with plain castas_* functions that do the casting.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200413162014.23884-2-espindola@scylladb.com>
It is currently not possible to wrap the view_updating_consumer in an
std::optional. I intend to do it to allow for compactions to optionally
generate view updates.
The reason for that is that view_updating_consumer has a reference as a
member, which makes the move assignment constructor not be implicitly
generated.
This patch fixes it by keeping a pointer instead of a reference.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200421123648.8328-1-glauber@scylladb.com>
This is a special partitioner that will be used by
CDC Log. It works only with partition key that is blob
composed of two ints. The first int is a token this
partitioner will map the key to. The second int is there
to make it possible to create multiple keys that are different
from each other but map to the same token.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Scylla inherited a concept of partitioners that preserve order of keys from
the origin but it is not used for anything. Moreover, none of the existing
partitioners preserves keys order. The only partitioner that did this in the
past was ByteOrderedPartitioner and Scylla does not support it any more.
For a partitioner to preserve an order of the keys means that if there are two
keys A and B such that A < B then token(A) < token(B) where token(X) isa token
the partitioner assignes to key X.
This patch removes dht::i_partitioner::preserves_order with all its overrides.
The only place that was using this member function was a check in thrift server
and it is safe to remove the check because the check was only done
to differentiate the error message for partitioners that do and do not preserve
the order of the keys.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
CL of LOCAL_QUORUM used to be hardcoded into CDC preimage query
and led to an error when number of replicas was lower than CL
would require. The solution here is to link the CLs of writes
to base table with the CLs of CDC reads, so the client will get
the (limited) control over the consistency of preimage SELECTs
(instead of getting error every time).
The algorithm is as follows:
1. If write that caused CDC activity was done with CL = ANY,
then do preimage read with CL = ONE.
2. If write that caused CDC activity was done with CL = ALL,
then do preimage read with CL = QUORUM.
3. SERIAL and LOCAL_SERIAL writes cause preimage read with QUORUM
and LOCAL_QUORUM, respectively.
4. In other cases do preimage read with the same CL as base write.
Currently, the alternator tests configure scylla to use all the
logical cores in the host system, but only 1GB of RAM. This can lead
to a small amount of memory per core.
It also uses the default disk configuration, which is safe, but can be
very slow on mechanical or non-enterprise disks.
Change to use a fixed --smp 2 configuration, and add --overprovisioned
for maximum flexibility (no spinning). Use --unsafe-bypass-fsync
for faster performance on non-enterprise or mechanical disks, assuming
that the test data is not important.
Fixes#6251.
Message-Id: <20200420154112.123386-1-avi@scylladb.com>
Merged patch series from Piotr Sarna:
This series allows reading rows from Scylla's system tables
via alternator by using a virtual interface.
If a Query or Scan request intercepts a table name with the following
pattern: .scylla.alternator.KEYSPACE_NAME.TABLE_NAME, it will read
the data from Scylla's KEYSPACE_NAME.TABLE_NAME table.
The interface is expected to only return data for Scylla system tables
and trying to access regular tables via this interface is expected
to return an error.
This series comes with tests (alternator-test, scylla_only).
Fixes#6122
Tests: alternator-test(local,remote (to verify that scylla_only works)
Piotr Sarna (5):
alternator: add fallback serialization for all types
alternator: add fetching static columns if they exist
alternator: add a way of accessing system tables from alternator
alternator-test: add scylla-only test for querying system tables
docs: add an entry about accessing Scylla system tables
alternator-test/test_system_tables.py | 61 +++++++++++++++++++++++++++
alternator/executor.cc | 38 ++++++++++++++++-
alternator/executor.hh | 1 +
alternator/serialization.cc | 11 +++--
docs/alternator/alternator.md | 15 +++++++
5 files changed, 122 insertions(+), 4 deletions(-)
create mode 100644 alternator-test/test_system_tables.py
And do the same with CDC_STREAMS_TIMESTAMP.
The code that took a list of tokens represented as a string inside
versioned_value (for gossiping) and deserialized it into
an `unordered_set<dht::token>` lived in the storage_service module,
while the code that did the serializing (set -> string) lived in
versioned_value. There was a similar situation with the CDC generation
timestamp.
To increase maintanability and reusability, the deserialization code is
now placed next to the serialization code in versioned_value.
Furthermore, the `make_full_token_string`, `make_token_string`, and
`make_cdc_streams_timestamp_string` (serialization functions) are moved
out of versioned_value::factory and made static methods of
versioned_value instead.
We have this in multishard_writer:
future<uint64_t> multishard_writer::operator()() {
return distribute_mutation_fragments().finally([this] {
return wait_pending_consumers();
}).then([this] {
return _consumed_partitions;
});
}
The wait_pending_consumers which waits for the consumers to finish is
called even when distribute_mutation_fragments fails.
When distribute_mutation_fragments fails and the failure is due to the
producer fails, consumers can wait for data which will never come because
the producer has failed already. This can cause a deadlock.
To fix, when distribute_mutation_fragments fails, we should abort the
queues that are attached to the readers used by the consumers.
Fixes#6241
-2^63 is a value reserved for min/max token boundaries and shouldn't be used for
regular tokens. This patch fixes get_random_token to never create token with
value -2^63. On the way dht::get_random_number template method is removed
because it was exclusively used by get_random_token.
Also use uniform_int_distribution with int64_t instead of uint64_t by using
correct constructor parameter that guarantees values between -2^63+1 and 2^63-1
inclusively.
Tests: unit(dev)
Fixes#6237.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <0a1a939355f5005039d5c2c7c513bad94cf60be2.1587302093.git.piotr@scylladb.com>
It turned out we cannot drop the information about most recent commit
entirely since it is used to cut off already outdate accepted values.
Otherwise the following scenario can happen:
1. cas1 prepares on A, B, C, gets one accept from A
2. cas2 prepares on B, C, gets 2 accepts on B and C, learns on B, C
3. cas3 initiates a prepare on A, learns about cas1's accept,
4. cas2 learns on A, prunes on A, B, C
Now cas3 will reply cas1's value because it does not know that it is
less than already committed on (removed during step 4).
The patch drops only committed value and keep the information about
latest committed ballot.
Fixed#6154
PAXOS node is allowed to accept a proposal without promising it
first as long as its ballot is greater than already promised one. Treat
such accepted ballot as promised since 'learn' stage removes accepted
ballot, but we still want to remember it as the latest promised one.
The goal is to be closer to formal PAXOS specification.
Max and min windows are microsecond timestamps, which should be divided
by window size in microseconds to properly estimate window count
based on provided mutation_source_metadata.
Found this problem after properly setting mutation_source_metadata with
min and max metadata on behalf of regular compaction.
Fixes#6214.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200409194235.6004-2-raphaelsc@scylladb.com>
"
Includes:
- code cleanups
- support for measuring data stores with more than one partition
- measure sstable footprint for all supported formats
- less verbose mode by default
"
* tag 'memory-footprint-test-improvement-v2' of github.com:tgrabiec/scylla:
test: memory_footprint: Silence logging by default
test: memory_footprint: Introduce --partition-count option
test: memory_footprint: Run under a cql_test_env
test: memory_footprint: Calculate sstable size for each format version
sstables: Move all_sstable_versions to version.hh
In order to avoid confusion with regard to whose responsibility
it is to sort the key columns (see #5856), the interface which allows
adding columns to the builder with explicit column id
is moved to a private function. An internal with_column_ordered()
overload is maintained to be used for internal operations,
but it's encouraged to use simpler with_column() in new code.
Fixes#6235
Tests: unit(dev)
When multiple key columns (clustering or partition) are passed to
the schema constructor, all having the same column id, the expectation
is that these columns will retain the order in which they were passed to
`schema_builder::with_column()`. Currently however this is not
guaranteed as the schema constructor sort key columns by column id with
`std::sort()`, which doesn't guarantee that equally comparing elements
retain their order. This can be an issue for indexes, the schemas of
which are built independently on each node. If there is any room for
variance between for the key column order, this can result in different
nodes having incompatible schemas for the same index.
The fix is to use `std::stable_sort()` which guarantees that the order
of equally comparing elements won't change.
This is a suspected cause of #5856, although we don't have hard proof.
Fixes: #5856
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
[avi: upgraded "Refs" to "Fixes", since we saw that std::sort() becomes
unstable at 17 elements, and the failing schema had a
clustering key with 23 elements]
Message-Id: <20200417121848.1456817-1-bdenes@scylladb.com>
The io_priority parameter used when generating view updates from
streaming is used by the sstable reader, so it should use the I/O priority
for streaming *read* operations, not streaming *write* operations.
Fixes#6231
Tests: unit(dev)
Commitlog replay is given a filename prefix to filter files against, but it
ignores it. As a result we will replay anything in that directory, including
recycled segments, which is wasteful.
Fix by adding a check for the prefix.
Tests: unit (dev), manual test that regular commitlog files are not
filtered.
Message-Id: <20200416174542.133230-1-avi@scylladb.com>
Some legacy `mc` SSTables (created in Scylla 3.0) may contain incorrect
serialization headers, which don't wrap frozen UDTs nested inside collections
with the FrozenType<...> tag. When reading such SSTable,
Scylla would detect a mismatch between the schema saved in schema
tables (which correctly wraps UDTs in the FrozenType<...> tag) and the schema
from the serialization header (which doesn't have these tags).
SSTables created in Scylla versions 3.1 and above, in particular in
Scylla versions that contain this commit, create correct serialization
headers (which wrap UDTs in the FrozenType<...> tag).
This commit does two things:
1. for all SSTables created after this commit, include a new feature
flag, CorrectUDTsInCollections, presence of which implies that frozen
UDTs inside collections have the FrozenType<...> tag.
2. when reading a Scylla SSTable without the feature flag, we assume that UDTs
nested inside collections are always frozen, even if they don't have
the tag. This assumption is safe to be made, because at the time of
this commit, Scylla does not allow non-frozen (multi-cell) types inside
collections or UDTs, and because of point 1 above.
There is one edge case not covered: if we don't know whether the SSTable
comes from Scylla or from C*. In that case we won't make the assumption
described in 2. Therefore, if we get a mismatch between schema and
serialization headers of a table which we couldn't confirm to come from
Scylla, we will still reject the table. If any user encounters such an
issue (unlikely), we will have to use another solution, e.g. using a
separate tool to rewrite the SSTable.
Fixes#6130.
Clean up the alternator.md document, by:
* Updating out-of-date information that outstayed its welcome.
* When Scylla does have a feature but it's just not supported via the
DynamoDB API (e.g., CDC and on-demand backups) mention that.
* Remove mention of Alternator being experimental and users should not
store important data on it :-)
* Miscellaneous cleanups.
Fixes#6179.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200412094641.27186-1-nyh@scylladb.com>
There is no reason to read a single SSTable at a time from the staging
directory. Moving SSTables from staging directory essentially involves
scanning input SSTables and creating new SSTables (albeit in a different
directory).
We have a mechanism that does that: compactions. In a follow up patch, I
will introduce a new specialization of compaction that moves SSTables
from staging (potentially compacting them if there are plenty).
In preparation for that, some signatures have to be changed and the
view_updating_consumer has to be more compaction friendly. Meaning:
- Operating with an sstable vector
- taking a table reference, not a database
Because this code is a bit fragile and the reviewer set is fundamentally
different from anything compaction related, I am sending this separately
* glommer-view_build:
staging: potentially read many SSTables at the same time
view_build_test: make sure it works with smp > 1
There is no reason to read a single SSTable at a time from the staging
directory. Moving SSTables from staging directory essentially involves
scanning input SSTables and creating new SSTables (albeit in a different
directory).
We have a mechanism that does that: compactions. In a follow up patch, I
will introduce a new specialization of compaction that moves SSTables
from staging (potentially compacting them if there are plenty).
In preparation for that, some signatures have to be changed and the
view_updating_consumer has to be more compaction friendly. Meaning:
- Operating with an sstable vector
- taking a table reference, not a database
Because this code is a bit fragile and the reviewer set is fundamentally
different from anything compaction related, I am sending this separately
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This test doesn't work with higher smp counts, because it relies on
dealing with keys named 'a' and 'b' and creates SSTables containing one
of them manually. This throws an exception if we happen to execute on
a shard that don't own the tokens corresponding to those keys.
This patch avoids that problem by pre-selecting keys that we know to
belong to the current shard in which the test is executed.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Rename inherited metrics cas_propose and cas_commit
to cas_accept and cas_learn respectively.
A while ago we made a decision to stick to widely accepted
terms for Paxos rounds: prepare, accept, learn. The rest
of the code is using these terms, so rename the metrics
to avoid confusion/technical debt.
While at it, rename a few internal methods and functions.
Fixes#6169
Message-Id: <20200414213537.129547-1-kostja@scylladb.com>
Initially we were storing CDC options in scylla tables but then we realized
that we can use schema extensions. Extensions are more flexible and cause less
problems with schema digest.
The transition was done in 4.0 and with that we stopped reading 'cdc' column
in scylla tables. Commit 861c7b5626 removed
the code that used to read 'cdc' column.
Since no Scylla node should be reading 'cdc' column, we can always keep
it empty now. This will allow removal of schema::cdc_options in the future.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
When I/O error (e.g. EMFILE / ENOSPC) happens we hit
an assert in ~append_challenged_posix_file_impl(): Assertion _closing_state == state::closed' failed.
Commit 6160b9017d add close on failure
of the lamda defined in allocate_segment_ex, but it doesn't handle an error
after the file is opened/created while it is wrapped with commitlog_file_extensions.
Refs #5657
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Calle Wilund <calle@scylladb.com>
Message-Id: <20200414115231.298632-1-bhalevy@scylladb.com>
Fixes#6195
test_commitlog_delete_when_over_disk_limit reads current segment list
in flush handler, to compare with result after allowing deletetion of
segement. However, it might be called more than once in rare cases,
because timing and us using rather small sizes.
Reading the list the second time however is not a good idea, because
it might just very well be exactly the same as what we read in the
test check code, and we actually overwrite the list we want to
check against. Because callback is on timer. And test is not.
Message-Id: <20200414114322.13268-1-calle@scylladb.com>
"
This is a continuation of recent efforts to cover more and more internal
de-serialization paths with `on_internal_error()`. Errors like this
should always be investigated but this can only be done with a core.
This patch covers the error paths of `composite::iterator` with
`on_internal_error()`. As we need this patch to investigate a 4.0
blocker issue (#6121) it only does the minimal amount of changes needed
to allow generating a core for de-serializiation failures of composites.
There are a few FIXMEs left in the code that I plan to address in a
follow-up.
Ref: #6121
"
* 'compound-on-internal-error/v1' of https://github.com/denesb/scylla:
compound_compat: composite::iterator cover error-paths with on_internal_error()
compound_compat: composite_view: add is_valid()
For a column of type `frozen<list<T>>` in base table, a corresponding
column of type `frozen<map<timeuuid, T>>` is created in cdc log.
Although a similar change of type takes place in case of non-frozen
lists, this is unneeded in case of frozen lists - frozen collections are
atomic, therefore there is no need for complicated type that will be
able to represent a column update that depends on its previous value
(e.g. appending elements to the end of the list).
Moreover, only cdc log table creation logic performs this type change
for frozen lists. The logic of `transformer::transform`, which is
responsible for creation of mutations to cdc log, assumes that atomic
columns will have their types unchanged in cdc log table. It simply
copies new value of the column from original mutation to the cdc log
mutation. A serialized frozen list might be copied to a field that is of
frozen map type, which may cause the field to become impossible to
deserialize.
This patch causes frozen list base table columns to have a corresponding
column in cdc log with the same type.
A test is added which asserts that the type of cdc log columns is not
changed in the case of frozen base columns.
Tests: unit(dev)
Fixes#6172
We have in alternator-test a set of over 340 functional tests for
Alternator. These tests are written in Python using the pytest
framework, expect Scylla to be running and connect to it using the
DynamoDB API with the "boto3" library (the AWS SDK for Python).
We have a script alternator-test/run which does everything needed
to run all these tests: Starts Scylla with the appropriate parameters
in a temporary directory, runs all the tests against it, and makes
sure the temporary directory is removed (regardless of whether the
tests succeeded or failed).
The goal of this small patch series is to integrate these Alternator
tests into test.py in a *simple* way. The idea is that we add *one*
test which just runs the aforementioned "run" script which does its
own business.
The changes we needed to do in this series to achieve this are:
1. Make the alternator-test/run script pick a unique IP address on which
to listen, instead of always using 127.0.0.1. This allows running
this test in parallel with dtest tests, or even parallel to itself.
2. Move the alternator-test directory to test/alternator. This is
the directory where test.py expects all the tests to live in.
It also makes sense - since we already have multiple subdirectories
in test/, to put the Alternator tests there too.
3. Add a new test suite type, "Run". A "Run" suite is simply a directory
with a script called "run", and this script is run to run the entire
suite, and this script does its own business.
4. Tests (such as the new "Run" ones) who can be killed gently and clean
up after themselves, should be killed with SIGTERM instead of
SIGKILL.
After this series, to run the Alternator tests from test.py, do:
./test.py --mode dev alternator
Note that in this version, the "--mode" has no effect - test/alternator/run
always runs the latest compiled Scylla, regardless of the chosen mode.
This can be fixed later.
The Alternator tests can still be run manually and individually against
a running Scylla or DynamoDB as before - just go to the test/alternator
directory and run "pytest" with the desired parameters.
Fixes#6046
* nyh/alternator-test-v3:
alternator-test: make Alternator tests runnable from test.py
test.py: add xunit XML output file for "Run" tests
test.py: add new test type "Run"
test.py: flag for aborting tests with SIGTERM, not SIGKILL
alternator-test: change "run" script to pick random IP address
alternator-test: add "--url" option to choose Alternator's URL
Since 956b092012 (Merge "Repair based node
operation" from Asias), repair is used by other node operations like
bootstrap, decommission and so on.
Send the reason for the repair, so that we can handle the materialized
view update correctly according to the reason of the operation. We want
to trigger the view update only if the repair is used by repair
operation. Otherwise, the view table will be handled twice, 1) when the
view table is synced using repair 2) when the base table is synced using
repair and view table update is triggered.
Fixes#5930Fixes#5998
Currently, lolwut with some parameters output broken square,
such as "lolwut 10 1 1":
127.0.0.1:6379> lolwut 10 1 1
⠀⡤⠤⠤⠤⠤⠤⠤⠤⠤
⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀
⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀
⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀
It because we passes incorrect parameters on draw_schotter().
Fixes#5808
Seems some gcc:s will generate the code as sign extending. Mine does not,
but this should be more correct anyhow.
Added small stringify test to serialization_test for inet_address
"
We currently allow null on the right-hand side of certain relations, while Cassandra prohibits it. Since our handling of these null values is mostly incorrect, it's better to match Cassandra in prohibiting it.
See the discussion (https://github.com/scylladb/scylla/pull/5763#discussion_r405557323.
NB: any reverse mismatch (Scylla prohibiting something that Cassandra allows) is left remaining. For example, we forbid null bounds on clustering columns, which Cassandra allows.
Tests: unit (dev)
"
* dekimir-match-cass-null:
restrictions: Forbid null bound for nonkey columns
restrictions: Forbid null equality
To make the tests in alternator-test runnable by test.py, we need to
move the directory alternator-test/ to test/alternator, because test.py
only looks for tests in subdirectories of test/. Then, we need to create
a test/alternator/suite.yaml saying that this test directory is of type
"Run", i.e., it has a single run script "run" which runs all its tests.
The "run" script had to be slightly modified to be aware of its new
location relative to the source directory.
To run the Alternator tests from test.py, do:
./test.py --mode dev alternator
Note that in this version, the "--mode" has no effect - test/alternator/run
always runs the latest compiled Scylla, regardless of the chosen mode.
The Alternator tests can still be run manually and individually against
a running Scylla or DynamoDB as before - just go to the test/alternator
directory (instead of alternator-test previously) and run "pytest" with
the desired parameters.
Fixes#6046
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Assumes that "Run" tests can take the --junit-xml=<path> option, and
pass it to ask the test to generate an XML summary of the run to a file
like testlog/dev/xml/run.1.xunit.xml.
This option is honored by the Alternator tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a new test type, "Run". A test subdirectory of type "Run"
has a script called "run" which is expected to run all the tests in that
directory.
This will be used, in the next patch, by the Alternator functional tests.
These tests indeed have a "run" script, which runs Scylla and then runs
*all* of Alternator's tests, finishing fairly quickly (in less than a
minute). All of that will become one test.py test.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Today, if test.py is interrupted with SIGINT or SIGTERM, the ongoing test
is killed with SIGKILL. Some types of tests - such as Alternator's test -
may depend on being killed politely (e.g., with SIGTERM) to clean up
files.
We cannot yet change the signal to SIGTERM for all tests, because Seastar
tests often don't deal well with signals, but we can at least add a flag
that certain test types - that know they can be killed gently - will use.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before this patch, the Alternator tests "run" script ran Scylla on a fixed
listening address, 127.0.0.1. There is a problem that there might be other
concurrent runs of Scylla using the same IP address - e.g., CCM (used by
dtest) uses exactly this IP address for its first node.
Luckily, Linux's loopback device actually allows us to pick any of over
a million addresses in 127.0.0.0/8 to listen on - we don't need to use
127.0.0.1 specifically. So the code in this patch picks an address in
127.1.*.*, so it cannot collide with CCM (which uses 127.0.0.* for up to
255 nodes). Moreover, the last two bytes of the listen address are picked
based on the process ID of the run script; This allows multiple copies
of this script to run concurrently - in case anybody wishes to do that.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The "--aws" and "--local" test options chooses between two useful default
URLs - Amazon's, or http://localhost:8000 for a local installation.
However, sometimes one wants to run Scylla on a different IP address or
port, so in this patch we add a "--url" option to choose a specific URL to
connect to. For example, "--url http://127.1.2.3:1234".
We will later use this option in the alternator-test/run script, to pick
a random IP address on which to run Scylla, and then run the test against
this address.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This reverts commit 1c444b7e1e. The test
it adds sometimes fails as follows:
test/boost/sstable_datafile_test.cc(1076): fatal error: in "autocompaction_control_test":
critical check cm->get_stats().pending_tasks == 1 || cm->get_stats().active_tasks == 1 has failed
Ivan is working on a fix, but let's revert this commit to avoid blocking
next promotion failing from time to time.
The first test case checks that system tables are readable via
Scan/Query requests.
The second test case checks that it's not possible to read user tables
by using the virtual interface.
The third test case checks that creating a table which looks like
an internal system table pattern (.scylla.alternator.KS_NAME.TABLE_NAME)
is not possible and returns a validation error.
Scylla's system tables often provide interesting information for
clients. In order to be able to access this information without CQL,
a notion of virtual tables is introduced to alternator.
When a table named .scylla.alternator.KS_NAME.TABLE_NAME is accessed
with read-only operation - Query or Scan, Scylla's internal
KS_NAME.TABLE_NAME table will be queried instead. For instance,
if a user wants to read about system_auth.roles, the Scan request
should target the following table: ".scylla.alternator.system_auth.roles".
Fixes#6122
Until now, the list of static column ids was always empty for alternator
tables anyway, so the list wasn't fetched. However, with the virtual
interface of fetching Scylla internal tables, we need to list the ids
of selected static columns explicitly to avoid segfaults - since we
select the whole row, static columns included.
While most types (e.g. boolean) are not valid key types for alternator users,
system tables derived from Scylla may still use this type for keys,
e.g. system_auth.roles. Note that types which are not directly
supported by alternator (e.g. double) will not be representable
out-of-the-box - instead, they simply fall back to string, which is both
human-readable and supported by alternator.
This patch adds API endpoint /column_family/autocompaction/{name}
that listen to GET and POST requests to pick and control table
background compactions.
To implement that the patch introduces "_compaction_disabled_by_user"
flag that affects if CompactionManager is allowed to push background
compactions jobs into the work.
It introduces
table::enable_auto_compaction();
table::disable_auto_compaction();
bool table::is_auto_compaction_disabled_by_user() const
to control auto compaction state.
Fixes#1488Fixes#1808Fixes#440
Tests: unit(sstable_datafile_test autocompaction_control_test), manual
Casting a type to itself doesn't make sense, but it is harmless so allow
it instead of reporting a confusing error message that makes even less
sense:
InvalidRequest: Error from server: code=2200 [Invalid query]
message="org.apache.cassandra.db.marshal.BooleanType cannot be cast
to org.apache.cassandra.db.marshal.BooleanType"
Note that some types already supported self-casting, this patch just
extends this to all types in a forward compatible way.
Fixes: #5102
Tests: unit(dev), manual test casting boolean to boolean.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200408135041.854981-1-bdenes@scylladb.com>
If a table name is not found, it may still exist as a local index,
but the check tried to fetch a local index name regardless if it was
present in the request, which was a nullptr dereference bug.
Fixes#6161
Tests: alternator-test(local, remote)
Message-Id: <428c21e94f6c9e450b1766943677613bd46cbc68.1586347130.git.sarna@scylladb.com>
We typically use `std::bad_function_call` to throw from
mandatory-to-implement virtual functions, that cannot have a meaningful
implementation in the derived class. The problem with
`std::bad_function_call` is that it carries absolutely no information
w.r.t. where was it thrown from.
I originally wanted to replace `std::bad_function_call` in our codebase
with a custom exception type that would allow passing in the name of the
function it is thrown from to be included in the exception message.
However after I ended up also including a backtrace, Benny Halevy
pointed out that I might as well just throw `std:bad_function_call` with
a backtrace instead. So this is what this patch does.
All users are various unimplemented methods of the
`flat_mutation_reader::impl` interface.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200408075801.701416-1-bdenes@scylladb.com>
* seastar fd9af3a26...cce2ddac1 (6):
> rpc: fix build failures in C++14 mode due to std::string_view
> util/backtrace: introduce make_backtraced_exception_ptr()
> future: make do_for_each noexcept
> fair_queue rename the fair_queue_descriptor and change its default init
> future: do_with: make noexcept
> io_queue: batch communication with the fair_queue for ready requests
Fixes#6143
When doing post-image generation, we also write values for columns not
in delta (actual update), based on data selected in pre-image row.
However, if we are doing initial update/insert with only a subset of
columns, when the pre-image result set is nil, this cannot be done.
Adds check to non-touched column post-image code. Also uses the
pre-image value extractor to handle non-atomic sets properly.
Tests updated.
This miniseries provides workarounds for open-ended range tombstones
reportedly appearing in alternator tables. The issue was that
row tombstones created for tables without clustering keys look
like open-ended range tombstones, which confuses the LA/KA format
writer.
Tests: alternator-test(local)
Fixes#6035
Refs #6157
The following sleep injections are added to paxos_state:
* paxos_state_prepare_timeout (timeouts in paxos_state::prepare)
* paxos_state_accept_timeout (timeouts in paxos_state::accept)
* paxos_state_learn_timeout (timeouts in paxos_state::learn)
Tests: unit ({dev}), unit ({debug})
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20200403092107.181057-1-alejo.sanchez@scylladb.com>
Creating an index on a table with only the partition key
can lead to open-ended range tombstones appearing,
if the indexed column is also the very same partition key -
which is quite a useless case, but it's allowed both by alternator
and DynamoDB. In order to make the tests pass when KA/LA sstables
are used, this test case is hereby skipped until further notice.
Refs #6157
As @tgrabiec helpfully pointed out, creating a row tombstone
for a table which does not have a clustering key in its schema
creates something that looks like an open-ended range tombstone.
That's problematic for KA/LA sstable formats, which are incapable
of writing such tombstones, so a workaround is provided
in order to allow using KA/LA in alternator.
Fixes#6035
I am about to change resharding to block the start of the node. Being a
somewhat slow operation, the timeout of 900 sec is guaranteed to trigger
in large nodes with lots of data.
This patch effectively disables the start timeout, while keeping the
stop timeout unchanged.
My preference would have been to use a timeout extension mechanism
during resharding. Systemd actually has such mechanism, where we can
send a message through sd_notify asking the timeout to be extended.
However such mechanism is not present in SystemD v219, used by RHEL7.
That means for RHEL7 we need a different way to deal with the timeout
anyway.
The second preference is also obviously to write "infinity" as the
timeout value. But guess what? SystemD v219 also has a bug in which
infinity is interepreted as zero
(https://bugzilla.redhat.com/show_bug.cgi?id=1446015)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200407155754.10020-1-glauber@scylladb.com>
But only non-validation error paths. When validating we do expect it to
maybe fail, so we don't want to generate cores for validation.
Validation is in fact a de-serialization pass with some additional
checks. To be able to keep reusing the same code for de-serialization
and validation just with different error handling, introduce a
`strict_mode` flag that can be passed to `composite::iterator`
constructor. When in strict mode (the default) the iterator will convert
any `marshal_exception` thrown during the de-serialization to
`on_internal_error()`.
We don't want anybody to use the iterator in non-strict mode, besides
validation, so the iterator constructors are made private. This is
standard practice for iterators anyway.
Freezing is an expensive operation, that involves serializing the entire
mutation. Having an implicit freezing constructor means this can happen
as part of an implicit type conversion without the programmer even
noticing, even when this is not really necessary.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200407080245.234021-1-bdenes@scylladb.com>
Until now this was open-coded in `sstables::validate_min_max_metadata()`.
We want to cover non-validation compound de-serialization error-paths
with `on_internal_error()` and so we need more control over how
compounds are validated. As a first step we want to centralize
validation in the class itself as in the next patches they will use
private APIs to bypass `on_internal_error()` in the error paths during
validation.
Any alter table statement that doesn't explicitly set the default time
to live will reset it to 0.
That can be very dangerous for time series use cases, which rely on
all data being eventually expired, and a default TTL of 0 means
data never being expired.
Fixes#5048.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200402211653.25603-1-raphaelsc@scylladb.com>
podman is compatible with docker, but by default emits a manifest
format that is not understood by old docker clients, so give it
an extra flag to generate the old format instead.
Message-Id: <20200406134526.21521-1-avi@scylladb.com>
There is no reason why the table code has to be aware of the efforts of
rewriting (cleanup, scrub, upgrade) an SSTable versus compacting it.
Rewrite is special, because we need to do it one SSTable at a time,
without lumping it together. However, the compaction manager is totally
capable of doing that itself. If we do that, the special
"table::rewrite_sstables" can be killed.
This code would maybe be better off as a thread, where we wouldn't need
to keep state. However there are some methods like maybe_stop_on_error()
that expect a future so I am leaving this be for now. This is a cleanup
that can be done later.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200401162722.28780-2-glauber@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6136 from
Piotr Sarna:
An empty partition/clustering key pair is a valid state of the
query paging state. Unfortunately, recent attempts at debugging
a flaky test (#5856) resulted in introducing an assertion (7616290)
which breaks when trying to generate a key from such a pair.
In order to keep the assertion (since it still makes sense in its
scope), but at the same time translate empty keys properly,
empty keys are now explicitly processed at the beginning of the
function.
This behaviour was 100% reproducible in a secondary index dtest below.
Fixes#6134
Refs #5856
Tests: unit(dev),
dtest(TestSecondaryIndexes.test_truncate_base)
To use install.sh as Scylla install script w/o using .rpm/.deb package,
we need to provide a way to upgrade Scylla version, not just install.
With --upgrade option, install.sh does not overwrite config files.
It will install <filename>.new file on same directory, when old config file and
new config file does not contain same data.
If old one and new one is exactly same, it will nothing.
To implement this, rewriting api_ui_dir/api_doc_dir path on scylla.yaml
moved from .rpm/.deb scriptlet to install.sh.
Fixes#5874
On some environment systemd-coredump does not work with symlink directory,
we can use bind-mount instead.
Also, it's better to check systemd-coredump is working by generating coredump.
To fix#5916, drop scylla_coredump_setup from .rpm %post scriptlet.
Fixes#5753Fixes#5916
"
This patch makes makes major compaction aware of time buckets
for TWCS. That means that calling a major compaction with TWCS
will not bundle all SSTables together, but rather split them
based on their timestamps.
There are two motivations for this work:
Telling users not to ever major compact is easier said than
done: in practice due to a variety of circumstances it might
end up being done in which case data will have a hard time
expiring later.
We are about to start working with offstrategy compactions,
which are compactions that work in parallel with the main
compactions. In those cases we may be converting SSTables from
one format to another and it might be necessary to split a single
big STCS SSTable into something that TWCS expects
In order to achieve that, we start by changing the way resharding works:
it will now work with a read interposer, similar to the one TWCS uses for
streaming data. Once we do that, a lot of assumptions that exist in the
compaction code can be simplified and supporting TWCS major
compactions become a matter of simply enabling its interposer in the
compaction code as well.
There are many further simplifications that this work exposes:
The compaction method create_new_sstable seems out of place. It is not
used by resharding, and it seems duplicated for normal compactions. We
could clean it up with more refactoring in a later patch.
The whole logic of the feed_writer could be part of the consumer code.
Testing details:
scylla unit tests (dev, release)
sstable_datafile_test (debug)
dtests (resharding_test.py)
manual scylla resharding
Fixes#1431
"
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
* 'twcs-major-v3' of github.com:glommer/scylla:
compaction: make major compaction time-aware with TWCS
compaction: do resharding through an interposer
mutation_writer: introduce shard_based splitting writer
mutation_writer: factor out part of the code for the timestamp splitter
compaction: abort if create_new_sstable is called from resharding
If get_schema_for_read() fails "prune" counter will not be decremented.
The patch fixes it by creating RAI object earlier. Also return releasing
of a mutation in release_mutation() which was dropped by mistake.
Fixes#6124
Message-Id: <20200405080233.GA22509@scylladb.com>
Since commit 9948f548a5, the LWT no longer
requires an "experimental" flag, so Alternator documents and scripts
which referred to the need for enabling experimental LWT, are fixed here
to no longer do that.
Fixes#6118.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200405143237.12693-1-nyh@scylladb.com>
When trying to get rid of a large stack warning for gossip test,
I found out that it actually does not run at all for multiple reasons:
1. It segfaults due to wrong initialization order
2. After fixing that, it segfaults on use-after-free (due to capturing
a shared pointer by reference instead of by copy)
3. After that, cleanups are in order:
* seastar thread does not need to be spawned inside another thread;
* default captures are harmful, so they're made explicit instead;
* db::config is moved to heap, to finally get rid of the warning.
Tests: manual(gossip)
Message-Id: <feaca415d0d29a16c541f9987645365310663630.1585128338.git.sarna@scylladb.com>
In order to check regressions related to #6136 and similar issues,
test cases for handling paging state with empty partition/clustering
key pair are added.
An empty partition/clustering key pair is a valid state of the
query paging state. Unfortunately, recent attempts at debugging
a flaky test resulted in introducing an assertion which breaks
when trying to generate a key from such a pair.
In order to keep the assertion (since it still makes sense in its
scope), but at the same time translate empty keys properly,
empty keys are now explicitly processed at the beginning of the
function.
This behaviour was 100% reproducible in a secondary index dtest below.
Fixes#6134
Refs #5856
Tests: unit(dev),
dtest(TestSecondaryIndexes.test_truncate_base)
Specify --pull in order to refresh the base image (some Fedora release).
Usually this is not important, because we run `dnf update`. But if the
cached image happens to be a pre-release version of Fedora, the image
will have the update-testing repository enabled, and we may get some
unwanted updates.
It's sad that we need two separate flags for correctness (the other
is --no-cache.
Message-Id: <20200405164227.8210-1-avi@scylladb.com>
The default serialization path for items was subtly broken -
instead of parsing JSON string representation of objects,
it tried to parse a regular string implementation - which is often
also a valid JSON, but nothing guarantees that it actually is.
Tests: alternator-test(local)
Message-Id: <e1668bf4e9029f2675a4ac28bb4598714575efeb.1586096732.git.sarna@scylladb.com>
Merge patch series from Piotr Sarna:
This series adds extra precautions against potential races
in view building. In particular, it was based on the following scenario:
1. View builder detects that a view V is no longer here, so it schedules
removing its info from bookkeeping, without any semaphores,
and this continuation gets preempted immediately.
2. A view is deleted and recreated with the same name - V.
3. View V building is finished.
4. The continuation from (1.) is finally executed, and it removes old view V
info from bookkeeping - which is a problem, since view building
bookkeeping is based on *names*, not *uuids* - consequently,
the new view bookkeeping info is erroneously removed.
The issue is solved by putting startup code (which also does cleanup
from point (1.)) under the same semaphore as other bookkeeping
operations. With that, it will be impossible to execute step (2.)
before (1.) ends, which effectively prevents the race.
Refs #6094 (possible fixes it too, but since I could not reproduce
the issue...)
Tests: unit(dev)
Piotr Sarna (4):
db,view: fix waiting for a view building future
db,view: remove unneeded implicit capture-by-reference
db,view: nitpick: change & operator to && for booleans
db,view: guard view builder startup with a semaphore
db/view/view.cc | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
This removes the need to include reactor.hh, a source of compile
time bloat.
In some places, the call is qualified with seastar:: in order
to resolve ambiguities with a local name.
Includes are adjusted to make everything compile. We end up
having 14 translation units including reactor.hh, primarily for
deprecated things like reactor::at_exit().
Ref #1
This allows us to drop a #include <reactor.hh>, reducing compile time.
Several translation units that lost access to required declarations
are updated with the required includes (this can be an include of
reactor.hh itself, in case the translation unit that lost it got it
indirectly via logalloc.hh)
Ref #1.
The startup routine performs some bookkeeping operations on views,
and so do these events:
- on_create_view;
- on_drop_view;
- on_update_view.
Since the above events are guarded with a semaphore, the startup
routine should also take the same semaphore - in order to ensure
that all bookkeeping operations are serialized.
Refs #6094
The future was marked with a `FIXME: discarded future`, but there's really
no reason not to wait for it, and it was probably meant to be waited for
since its implementation.
When we switched token representation to int64_t
we added some sanity checks that byte representation
is always 8 bytes long.
It turns out that for token_kind::before_all_keys and
token_kind::after_all_keys bytes can sometimes be empty
because for those tokens they are just ignored. The check
introduced with the change is too strict and sometimes
throws the exception for tokens before/after all keys
created with empty bytes.
This patch relaxes the condition of the check and always
uses 0 as value of _data for special before/after all keys
tokens.
Fixes#6131
Tests: unit(dev, sct)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
* seastar 41c83ec...fd9af3a (7):
> stall_detector: Delete unused member variable
> future: Avoid a move in finally_body
> Merge "Followup cleanups for the apply/invoke split" from Rafael
> Merge "make trivial future related functions noexcept" from Benny
> rpc_test: silence depreceted lambda logger warning
> rpc_demo: stop using variadic futures
> future: Move two static_asserts to the top
Currently we call `on_internal_error()` if `tri_compare()` throws
`marshal_exception`. Some compare paths however might go around
`tri_compare()` and call `abstract_type::compare()` directly. Move the
check there to cover these cases too.
Tests: dev
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200403162530.1175801-1-bdenes@scylladb.com>
This patch makes makes major compaction aware of time buckets
for TWCS. That means that calling a major compaction with TWCS
will not bundle all SSTables together, but rather split them
based on their timestamps.
There are two motivations for this work:
1. Telling users not to ever major compact is easier said than
done: in practice due to a variety of circumstances it might
end up being done in which case data will have a hard time
expiring later.
2. We are about to start working with offstrategy compactions,
which are compactions that work in parallel with the main
compactions. In those cases we may be converting SSTables from
one format to another and it might be necessary to split a single
big STCS SSTable into something that TWCS expects
With the motivation out of the way, let's talk about the implementation:
The implementation is quite simple and builds upon the previous patches.
It simply specializes the interposer implementation for regular compaction
with a table-specific interposer.
Fixes#1431
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Our resharding code is complex, since the compaction object has to keep
track of many output SSTables, the current shard being written.
When implementing TWCS streaming writers, we ran away from such
write-side complexity by implementing an interposer: the interposer
consumes the flat_mutation_reader stream, creating many different writer
streams. We can do a similar thing for resharding SSTables and have each
writer be guaranteed to contain keys for only a specific source shard.
As we do that, we can move the SSTable and sstable_writer information
to the compacting_sstable_writer object. The compaction object will no
longer be responsible for it and can be simplified, paving the way for
TWCS-major, which will go through an interposer as well.
Note that the compaction_writer, which now holds both the SSTable
pointer and the sstable_writer still needs to be optional. This is
because LCS (and potentially others) still want to create more than
one SSTable per source stream. That is done to guarantee that each
SSTable complies with the max_sstable_size parameter, which is
information available in the sstable_writer that is not present at
the level of the flat_mutation_reader. We want to keep it in the writer
side.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The storage_proxy instances hold references to token_metadata ones and
leave unwaited futures continuing to its query_partition_key_range_concurrent
method.
The latter is called from do_query so it's not that easy to find
out who is leaking. Keep the tokens not freed for a while.
Fixes: #6093
Test: manual start-stop
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200402183538.9674-1-xemul@scylladb.com>
Speculative reader has more targets that needed for CL. In case there is
a digest mismatch the repair runs between all of them, but that violates
provided CL. The patch makes it so that repair runs only between
replicas that answered (there will be CL of them).
Fixes#6123
Reviewed-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200402132245.GA21956@scylladb.com>
distcc doesn't like the -x c++ flag, so create an empty.cc file for
this purpose and compile it.
Also drop the "=" from "--include=", which is also disliked by
distcc.
Message-Id: <20200402124312.48963-1-avi@scylladb.com>
This is similar to the timestamp based splitting writer, except
that it splits data based on the shard where the partition key
is supposed to be placed.
It is similar to the multishard_writer, in the sense that it
creates n streams for n shards, but it does not want to process
the streams in the owner shards. We want to use that in processes
like resharding where it is fine for a foreign shard to deal
with a mutation.
One option would be to augment the multishard_writer to optionally
achieve these properties, but having a separate splitter is both
simpler and faster.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
I am about to introduce a new splitter. Therefore, move parts of it
that are common to its own file.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
I am about to get rid of the _shard attribute in the compaction object,
as I will create different streams of writers for different shards.
In preparation for that, remove the arbitrary _shard reference. Raphael
confirms that resharding should never be calling this, as this method is
used exclusively for garbage collection component of run-based
compaction. Therefore we'll just throw in this case and remove the shard
reference.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The shard parameter is ignored for SSTable creation on regular
compaction. It is still good practice and good future proofing
to pass something meaningful here instead of zero. This patch
passes the id of the current shard.
Thanks Botond for pointing that out.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200402122212.12218-1-glauber@scylladb.com>
"
This patchseries is part of my effort to make resharding less special -
and hopefully less problematic. The next steps are a bit heavy, so I'd
like to, if possible, get this out of the way.
After these two patches, there is no more need to ever call
reshard_sstables: compact_sstables will do, and it will be able to
recognize resharding compactions.
To do that we need to unify the creator function, which is trivially
done by adding a shard parameter to regular compactions as well: they
can just ignore it. I have considered just making the
compaction_descriptor have a virtual create() function and specializing
it, but because we have to store the creator in the compaction object I
decided to keep the virtual function for now.
In a later cleanup step, if we can for instance store the entire
compaction_descriptor object in the compaction object we could do that.
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Tests: unit tests (dev), dtest (resharding.py)
"
* 'resharding-through-compact-sstables' of github.com:glommer/scylla:
resharding: get rid of special reshard_sstables
compaction: enhance compaction_descriptor with creator and replace function
This reverts commit fdd2d9de3d because it
breaks one heat-weighted load balancing dtest:
FAIL: heat_weighted_load_balancing_cl_QUORUM_test (heat_weighted_load_balancing_test.HeatWeightedLB)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/penberg/src/scylla/scylla-dtest/heat_weighted_load_balancing_test.py", line 182, in heat_weighted_load_balancing_cl_QUORUM_test
self.run_heat_weighted_load_balancing('QUORUM')
File "/home/penberg/src/scylla/scylla-dtest/heat_weighted_load_balancing_test.py", line 165, in run_heat_weighted_load_balancing
self.verify_metrics(metrics, cached=False)
File "/home/penberg/src/scylla/scylla-dtest/heat_weighted_load_balancing_test.py", line 73, in verify_metrics
mean_avg, node_mean_avg, key))
AssertionError: 19.0 not found in range(3, 13) : Cache difference between nodes is less then expected: 6469.6/328.2, metric scylla_storage_proxy_coordinator_reads_local_node
I am reverting because it's a test issue, and we should bring this
commit back once the test is fixed.
Gleb Natapov explains:
"dtest result directly depends on replicas we contact. Glauber's patch
make us contacts less replicas, so numbers differ."
The "run" script for the Alternator tests needs to set a system table for
authentication credentials, so we can test this feature.
So far we did this with cqlsh, but cqlsh isn't always installed on build
machines. But install-dependencies.sh already installs the Cassandra driver
for Python, so it makes more sense to use that, so this patch switches to
use it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200331131522.28056-1-nyh@scylladb.com>
To run Alternator tests, only two additional dependencies need to be added to
install-dependencies.sh: pytest, and python3-boto3. We also need
python3-cassandra-driver, but this dependency is already listed.
This patch only updates the dependencies for Fedora, which is what we need
for dbuild and our Jenkins setups.
Tested by building a new dbuild docker image and verifying that the
Alternator tests pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
[avi: update toolchain image; note this upgrades gcc to 9.3.1]
Message-Id: <20200330181128.18582-1-nyh@scylladb.com>
In order to be extra sure that we always generate proper
base partition/clustering keys from paging info when executing
an indexed query, additional checks are added - if any of them
triggers, an exception will be thrown.
Created in order to help debug an existing issue:
Refs #5856
Tests: unit(dev)
Due to c&p error cas_now_pruning counter is increased instead of
decreased after an operation completes. Fix it.
Fixes#6116
Message-Id: <20200401142859.GA16953@scylladb.com>
Always enable lightweight transactions. Remove the check for the command
line switch from the feature service, assuming LWT is always enabled.
Remove the check for LWT from Alternator.
Note that in order for the cluster to work with LWT, all nodes need
to support it.
Rename LWT to UNUSED in db/config.hh, to keep accepting lwt keyword in
--experimental-features command line option, but do nothing with it.
Changes in v2:
* remove enable_lwt feature flag, it's always there
Closes#6102
test: unit (dev, debug)
Message-Id: <20200401071149.41921-1-kostja@scylladb.com>
During bootstrap and replace operations the node can't take reads and
we'd like to see the process ending ASAP. This is because until the
process ends, we keep having to duplicate writes to an extended set. Not
to mention, in the case of a cluster expansion users want to use the
added capacity sooner rather than later.
Streaming generates a lot of compaction activity, that competes with the
bootstrap itself, slowing it down.
Long term, we are moving to treat those compactions differently and
maybe postpone them altogether. However for now we can reduce the amount
of compactions by increasing the minimum threshold of SSTables that have
to accumulate before they are selected for compactions. The default is
2, meaning we will trigger a compaction every time 2 SSTables of about
the same size are found (for STCS, others follow a similar pattern).
Until we have offstrategy infrastructure we don't want the compactions
to stop happening altogether so the reads, when they start, don't
suffer. This patch sets the minimum threshold to 16 (for the default
max_threshold of 32), meaning we will generate a lot less compaction
activity during streaming. Once streaming is done we revert it to its
original.
Unfortunately there isn't much we can do at the moment about decommission.
During decommission the nodes receiving data are also taking reads and
we don't want SSTables to accumulate.
Fixes#5109
Signed-off-by: Glauber Costa <glauber@scylladb.com>
dc_local_read_repair_chance is a legacy of old times: Cassandra itself
now defaults to zero, and we should look into that too.
Most serious production clusters are either repaired through our
asynchronous repair, or don't need repair at all.
Synchronous read repair can help things converging, but it implies an
impact at query time. For clusters that are on an asynchronous repair
schedule this should not be needed.
Fixes#6109
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200331183418.21452-1-glauber@scylladb.com>
There is a method, reshard_sstables(), whose sole purpose is to call a
resharding compaction. There is nothing special about this method: all
the information it needs is now present in the compaction_descriptor.
This patch extend the compaction_options class to recognize resharding
compactions as well, and uses that so that make_compaction() can also
create resharding compactions.
To make that happen we have to create a compaction_descriptor object in
the resharding method. Note however that resharding works by passing an
object very close to the compaction_descriptor around. Once this patch
is merged, a logical next step is to reuse it, and avoid creating the
descriptor right before calling compact_sstables().
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are many differences between resharding and compaction that are
artificial, arising more from the way we ended up implementing it than
necessity. This patch attempts to pass the creator and replacer functions
through the compaction_descriptor.
There is a difference between the creator function for resharding and
regular compaction: resharding has to pass the shard number on behalf
of which the SSTable is created. However regular compactions can just
ignore this. No need to have a special path just for this.
After this is done, the constructor for the compaction object can be
greatly simplified. In further patches I intend to simplify it a bit
further, but some more cleanup has to happen first.
To make that happen we have to construct a compaction_descriptor object
inside the resharding function. This is temporary: resharding currently
works with a descriptor, but at some point that descriptor is lost and
broken into pieces to be passed to this function. The overarching goal
of this work is exactly to be able to keep that descriptor for as long
as possible, which should simplify things a lot.
Callers are patched, but there are plenty for sstable_datafile_test.cc.
For their benefit, a helper function is provided to keep the previous
signature (test only).
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
Currently, both sharding and partitioning logic is encapsulated into partitioners. This is not desirable because these two concepts are totally independent and shouldn't be coupled together in such a way.
This PR separates sharding and partitioning. Partitioning will still live in i_partitioner class and its subclasses. Sharding is extracted to a new class called sharding_info. Both partitioners and sharding_info are still managed by schema class. Partitioner can be accessed with schema::get_partitioner while sharding_info can be accessed with schema::get_sharding_info.
The transition is done in steps:
1. sharding_info class is defined and all the sharding logic is extracted from partitioner to the new class. Temporarily sharding_info is still embedded into i_partitioner and all sharding related functions in i_partitioner call delegate to the embedded sharding_info object.
2. All calls to i_partitioner functions that are related to sharding are gradually switched to calls to sharding_info equivalents. sharding_info.
3. Once everything uses sharding_info, all sharding logic is dropped from i_partitioner.
Tests: unit(dev, release)
"
* haaawk-sharding_info: (32 commits)
dummy_sharder: rename dummy_sharding_info.* to dummy_sharder.*
sharding_info: rename the class to sharder
i_partitioner:remove embeded sharding_info
i_partitioner: remove unused get_sharding_info
schema: remove incorrect comment
schema: make it possible to set sharding_info per schema
i_partitioner: remove unused shard_count
multishard_writer: stop calling i_partitioner::shard_count
i_partitioner: remove sharding_ignore_msb
partitioner_test: test ranges and sharding_infos
i_partitioner: remove unused split_ranges_to_shards
i_partitioner: remove unused shard_of function
sstable-utils: use sharding_info::shard_of
create_token_range_from_keys: use sharding info for shard_of
multishard_mutation_query_test: use sharding info for shard_of
distribute_reader_and_consume_on_shards: use sharding_info::shard_of
multishard_mutation_query: use sharding_info::shard_of
dht::shard_of: use schema::get_sharding_info
i_partitioner: remove unused token_for_next_shard
split_range_to_single_shard: use sharding info instead of partitioner
...
Unfortunately, the boto3 library doen't allow us to check some of the
input error cases because it unnecessarily tests its input instead of
just passing it to Alternator and allowing Alternator to report the error.
In this patch we comment out a test case which used to work fine - i.e.,
the error was reported by Alternator - until recent changes to boto3
made it catch the problem without passing it to Alternator :-(
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200330190521.19526-2-nyh@scylladb.com>
One of the Alternator tests in test_tag.py checks the feature of creating
a table with a set of tags (as opposed to adding tags to an existing table).
This is a relatively new DynamoDB feature, only added in April 2019, so if
the botocore library is too old, it cannot test this feature, and we have to
skip the test.
Alternator developers should make an effort to keep the botocore library
up-to-date and test the latest DynamoDB features, but it is less important
if some test environments (like Jenkins) cannot verify this specific test
until its distro gets updated - it is more important that the fast majority
of the tests, which do not rely on very new features, get tested.
After this patch, if running on Fedora 30 with
python3-botocore-1.12.101-2.fc30.noarch installed, we get the following
skip message:
$ pytest-3 -rs test_tag.py
...
test_tag.py ..s..x [100%]
=================================================== short test summary info ===================================================
SKIP [1] /home/nyh/scylla/test/alternator/test_tag.py:114: Botocore version 1.12.136 or above required to run this test
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200330190521.19526-1-nyh@scylladb.com>
The learning stage of PAXOS protocol leaves behind an entry in
system.paxos table with the last learned value (which can be large). In
case not all participants learned it successfully next round on the same
key may complete the learning using this info. But if all nodes learned
the value the entry does not serve useful purpose any longer.
The patch adds another round, "prune", which is executed in background
(limited to 1000 simultaneous instances) and removes the entry in
case all nodes replied successfully to the "learn" round. It uses the
ballot's timestamp to do the deletion, so not to interfere with the
next round. Since deletion happens very close to previous writes it will
likely happen in memtable and will never reach sstable, so that reduces
memtable flush and compaction overhead.
Fixes#5779
Message-Id: <20200330154853.GA31074@scylladb.com>
Previously schema::get_sharding_info was obtaining
sharding_info from the partitioner but we want to remove
sharding_info from the partitioner so we need a place
in schema to store it there instead.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Previous patches have switched all the calls to
i_partitioner::shard_count to sharding_info::shard_count
and this function can now be removed.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Every place that has previously called this method is now
using sharding_info::sharding_ignore_msb and this function
can be removed.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Turn test_something_with_some_interesting_ranges_and_partitioners
into test_something_with_some_interesting_ranges_and_sharding_info.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Previous patches switched all the places that called
i_partitioner::shard_of to use sharding_info::shard_of
so i_partitioner::shard_of is no longer used and can
be removed.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Create sharding_info with the same parameters as
the partitioner and use it instead of the partitioner.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This patch replaces all the uses of i_partitioner:shard_of
with sharding_info::shard_of in read_context.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Previous patches have switched all the places that was
using i_partitioner::token_for_next_shard to
sharding_info::token_for_next_shard. Now the function
can be removed from i_partitioner.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The function relies only on i_partitioner::shard_count
and i_partitioner::token_fon_next_shard. Both are really implemented
in sharding_info so the method can use them directly.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
After previous patches that switched some tests to use sharding_info
instead of i_partitioner, we now don't need with_partitioner_for_tests_only
and the function can be removed.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
dummy_partitioner was renamed to dummy_sharding_info in
the previous patch. This patch cleans up the names of
files. It's done in a separate patch to not obstruct
the diff of previous patch.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Previously this function was accessing sharding logic
through partitioner obtained from the schema.
While converting tests, dummy_partitioner is turned into
dummy_sharding_info.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
sync_schema is supposed to make sure that this node knows about all
schema changes known by "nodes" that were made prior to this call.
Currently, when a node is down, the sync is sliently skipped.
To fix, add a flag to migration_task::run_may_throw to indicate that it
should fail if a node is down.
Fixes#4791
The value that is stored in "in_progress_ballot" cell is the value of
promised ballot, so call the cell accordingly to avoid confusion
especially as we have a notion of "in progress" proposal in the code
which is not the same as in_progress_ballot here.
We can still do it without care about backwards compatibility since LWT
is still marked as experimental.
Fixes#6087.
Message-Id: <20200326095758.GA10219@scylladb.com>
partitioner_test contains test_partitioner_sharding function
which this patch renames to test_sharding and makes it
use sharding_info instead of the partitioner.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
repair does not use partitioner and only uses sharding logic.
This means it does not have to depend on i_partitioner and can
instead operate on sharding_info.
This has an important consequence of allowing the repair of
multiple tables having different partitioners at the same time.
All tables repaired together still have to use the same
sharding logic.
To achieve this the change:
1. Removes partitioner field from repair_info
2. repair_info has access to sharding_info through schema
objects of repaired tables
3. partitioner name is removed from shard_config
4. local and remote partitioners are removed from repair_meta.
Remote sharding_info is used instead.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This method is not implemented anywhere not to mention the usage.
It is the only resonable thing to remove it instead of keeping
an unused and unimplemented declaration.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The class does not depend on partitioning logic but only uses
sharding logic. This means it is possible and desirable to limit its
dependency to only sharding_info.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
ring_position_range_vector_sharder does not depend on partitioning logic.
It only uses sharding logic so it is not necessary to store i_partitioner
in the class. Reference to sharding_info is enough.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
ring_position_range_sharder does not depend on partitioning at all.
It only uses sharding so it is enough for the class to take sharding_info
instead of a whole i_partitioner. This patch changes ring_position_range_sharder
class to contain const sharding_info& instead of const i_partitioner&.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
At the moment, we have a single sharding logic per node
but we want to be able to set it per table in the future.
To make it easy to change in the future sharding_info
will be managed inside schema and all the other code
will access it through schema::get_sharding_info function.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This patch creates a new class called sharding_info.
This new class will now be responsible for all
the sharding logic that before was a part of the partitioner.
In the end, sharding and partitioning logic will be fully
separated but this patch starts with just extracting sharding
logic to sharding_info and embedding it into i_partitioner class.
All sharding functions are still present in i_partitioner but now
they just delegate to the corresponding functions of the embedded
sharding_info object.
Following patches will gradually switch all uses of the following
i_partitioner member functions to their equivalents in sharding_info:
1. shard_of
2. token_for_next_shard
3. sharding_ignore_msb
4. shard_count
After that, sharding_info will be removed from i_partitioner and
the two classes will be totally independent.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The user migration_manager::submit_migration_task needs to know if
migration_task::run_may_throw is successful or not.
Do not swallow exception.
Fixes#4791
Make the constructor out-of-line and clean up includes made redundant.
This removes an include of Seastar's heavy reactor.hh from a header.
Ref #1
Message-Id: <20200329173711.16949-1-avi@scylladb.com>
* seastar c7b6b84e5...06a8c8f6e (12):
> scheduling_group_specific: remove inclusion of reactor.hh
> future: Delete void_futurize_helper
> future: Delete unused do_void_futurize_helper instantiation
> core: remove io_queue queued requests metric
> future: Add assert to set_urgent_state
> future: Add a comment to set_urgent_state
> future: Use placement new instead of operator= in set_urgent_state
> file: use correct io_queue in dup()d files
> io_queue: fix miscalculation of sizes when I/O queue is not configured.
> merge: Add log levels to RPC loggers
> reactor: Replace a call to cpu_id with this_shard_id()
> reactor: Drop a few redundant calls to engine()
The column-family is already looked up as the first line in the method.
No need to repeat that lookup in the lambda passed to
`run_when_memory_available()`, we can just capture the reference to the
already obtained column-family object. These objects are safe to
reference, they don't just disappear in the middle of an operation.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200327140827.128647-1-bdenes@scylladb.com>
Consider 3 nodes in the cluster, n1, n2, n3 with gossip generation
number g1, g2, g3.
n1, n2, n3 running scylla version with commit
0a52ecb6df (gossip: Fix max generation
drift measure)
One year later, user wants the upgrade n1,n2,n3 to a new version
when n3 does a rolling restart with a new version, n3 will use a
generation number g3'. Because g3' - g2 > MAX_GENERATION_DIFFERENCE and
g3' - g1 > MAX_GENERATION_DIFFERENCE, so g1 and g2 will reject n3's
gossip update and mark g3 as down.
Such unnecessary marking of node down can cause availability issues.
For example:
DC1: n1, n2
DC2: n3, n4
When n3 and n4 restart, n1 and n2 will mark n3 and n4 as down, which
causes the whole DC2 to be unavailable.
To fix, we can start the node with a gossip generation within
MAX_GENERATION_DIFFERENCE difference for the new node.
Once all the nodes run the version with commit
0a52ecb6df, the option is no logger
needed.
Fixes#5164
Most of Scylla code runs with a user-supplied query timeout, expressed as
absolute clock (deadline). When injecting test sleeps into such code, we most
often want to not sleep beyond the user supplied deadline. Extend error
injection API to optionally accept a deadline, and, if it is provided,
sleep no more than up to the deadline. If current time is beyond deadline,
sleep injection is skipped altogether.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20200326091600.1037717-2-alejo.sanchez@scylladb.com>
The previous patch made the LA format the default. We no longer need to
choose between writing the older KA format or LA, so the LA_SSTABLE
cluster feature has became unnecessary.
Unfortunately, we cannot completely remove this feature: Since commit
4f3ce42163 we cannot remove cluster features
because this node will refuse to join a cluster which already agreed on
features that it lacks - thinking it is an old node trying to join a
new cluster.
So the LA_SSTABLE feature flag remains, and we continue to advertise
that our node supports it. We just no longer care about what other
nodes advertised for it, so we can remove a bit of code that cared.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200324232607.4215-3-nyh@scylladb.com>
Over the years, Scylla updated the sstable format from the KA format to
the LA format, and most recently to the MC format. On a mixed cluster -
as occurs during a rolling upgrade - we want all the nodes, even new ones,
to write sstables in the format preferred by the old version. The thinking
is that if the upgrade fails, and we want to downgrade all nodes back to
the older version, we don't want to lose data because we already have
too-new sstables.
So the current code starts by selecting the oldest format we ever had - KA,
and only switching this choice to LA and MC after we verify that all the
nodes in the cluster support these newer formats.
But before an agreement is reached on the new format, sstables may already
be created in the antique KA format. This is usually harmless - we can
read this format just fine. However, the KA format has a problem that it is
unable to represent table names or keyspaces with the "-" character in them,
because this character is used to separate the keyspace and table names in
the file name. For CQL, a "-" is not allowed anyway in keyspace or table
names; But for Alternator, this character is allowed - and if a KA table
happens to be created by accident (before the LA or MC formats are chosen),
it cannot be read again during boot, and Scylla cannot reboot.
The solution that this patch takes is to change Scylla's default sstable
format to LA (and, as before, if the entire cluster agrees, the newer MC
format will be used). From now on, new KA tables will never be written.
But we still fully support *reading* the KA format - this is important in
case some very old sstables never underwent compaction.
The old code had, confusingly, two places where the default KA format
was chosen. This patch fixes is so the new default (LA) is specified in
only one place.
Fixes#6071.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200324232607.4215-2-nyh@scylladb.com>
* seastar 92c488706...c7b6b84e5 (6):
> semaphore: Use futurize_invoke instead of futurize_apply
> future: specify futurize::make_exception_future as noexcept
> future: Move ignore out of line
> future: Split then and then_impl to enable NRVO
> semaphore_units: allow getting the number of units held
> Merge "Split futurize::apply into invoke(...) and apply(tuple)" from Rafael
sync_schema is supposed to make sure that this node knows about all
schema changes known by "nodes" that were made prior to this call.
Currently, when a node is down, the sync is sliently skipped.
To fix, add a flag to migration_task::run_may_throw to indicate that it
should fail if a node is down.
Fixes#4791
Fixes#6073
The logic with pre/post image was tangled into looking at "rs"
and would cause pre-image info to be stored even if only post-image
data was enabled.
Now only generate keys (and rows for them) iff explicitly enabled.
And only generate pre-image key iff we have pre-image data.
Fixes#6070
When mutation splitting was added, non-atomic column assignments were broken
into two invocation of transform. This means the second (actual data assignment)
does not know about the tombstone in first one -> postimage is created as if
we were _adding_ to the collection, not replacing it.
While not pretty, we can handle this knowing that we always get
invoked in timestamp order -> tombstone first, then assign.
So we simply keep track of non-atomic columns deleted across calls
and filter out preimage data post this.
Added test cases for all non-atomics
Now that #3158 is fixed, we can move the consumer to its place after
the `compaction_mutation_state::start_new_page()` call. No need to keep
it as `std::unique_ptr<>`.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200310185147.207665-1-bdenes@scylladb.com>
"
This small series consists of several changes that aim to
reduce the number of shared_ptr's in cql3 code.
Also it contains a patch that makes CqlParser::query to return
std::unique_ptr<> instead of seastar::shared_ptr<>, which leads
to more understandable code and lays foundation for further
optimizations (e.g. possibly eliminating shared_ptr's in
`prepared_statement` and just moving raw statements in `prepare`
without copying them).
Tests: unit(dev, debug)
"
* 'feature/cql_cleanups_9' of https://github.com/ManManson/scylla:
cql3: return raw::parsed_statement as unique_ptr
cql3: de-pointerize arguments to some of CQL grammar rules and definitions.
cql3: make abstract_marker::make_in_receiver accept cref to column_specification
Fixes#5899
When terminating (closing) a segment, we write a trailing block
of zero so reader can have an empty region after last used chunk
as end marker. This is due to using recycled, pre-allocated
segments with potentially non-zero data extending over the point
where we are ending the segment (i.e. we are not fully filling
the segment due to a huge mutation or similar).
However, if we reach end of segment writing the final block
(typically many small mutations), the file will end naturally
after the data written, and any trailing zero block would in fact
just extend the file further. While this will only happen once per
segment recycled (independent on how many times it is recycled),
it is still both slightly breaking the disk usage contract and
also potentially causing some disk stalls due to metadata changes
(though of course very infrequent).
We should only write trailing zero if we are below the max_size
file size when terminating
Adds a small size check to commitlog test to verify size bounds.
(Which breaks without the patch)
v2:
- Fix test to take into account that files might be deleted
behind our backs.
v3:
- Fix test better, by doing verification _before_ segments are
queued for delete.
Message-Id: <20200226121601.15347-2-calle@scylladb.com>
Message-Id: <20200324100235.23982-1-calle@scylladb.com>
Change CQL parsing routine to return std::unique_ptr
instead of seastar::shared_ptr.
This can help reduce redundant shared_ptr copies even further.
Make some supplementary changes necessary for this transition:
* Remove enabled_shared_from_this base class from the following
classes: truncate_statement, authorization_statement,
authentication_statement: these were previously constructing
prepared_statement instance in `prepare` method using
`shared_from_this`.
Make `prepare` methods implementation of inheriting classes
mirror implementation from other statements (i.e.
create a shallow copy of the object when prepairing into
`prepared_statement`; this could be further refactored
to avoid copies as much as possible).
* Remove unused fields in create_role_statement which led to
error while using compiler-generated copy ctor (copying
uninitialied bool values via ctor).
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Make the following rules and definitions accept a reference
instead of shared_ptr's:
* cfamDefinition
* cfamColumns
* pkDef
* typeColumns
* ksName
* cfName
* idxName
* properties
* property
This will reduce a bit the number of countless shared_ptr copies
and moves all over the place in cql3 code.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
These methods just extract some info out of
column_specification, so no need have another copy of
shared_ptr since it's not stored anywhere inside.
Transform abstract_marker::in_raw::make_in_receiver as well
following the call chain.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6030 from
Piotr Dulikowski:
Adds CDC-related metrics.
Following counters are added, both for total and failed operations:
Total number of CDC operations that did/did not perform splitting,
Total number of CDC operations that touched a particular mutation part.
Total number of preimage selects.
Fixes#6002.
Tests: unit(dev, debug)
* 'cdc-metrics' of github.com:piodul/scylla:
storage_proxy: track CDC operations in LWT flow
storage_proxy: track CDC operations in logged batches
storage_proxy: track CDC operations in standard flow
storage_proxy: add cdc tracker hooks to write response handlers
storage_proxy: move "else if" remainder into "else" block
cdc: create an operation_result_tracker object
cdc: add an object for tracking progress of cdc mutations
cdc: count touched mutation parts in transformer::transform
cdc: track preimage selects in metrics
cdc: register metric counters
cdc: fix non-atomic updates in splitting
Compaction automatically adds gc grace period to expiry times already,
no need to add it when creating the tombstones. Remove the redundant
additions form the code. The direct impact is really minor as this is
only used in tests, but it might confuse readers who are looking at how
tombstones are created across the codebase.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200323120948.92104-1-bdenes@scylladb.com>
Adds a field to abstract_write_response_handler that points to the cdc
operation result tracker, and a function for registering the tracker in
the handlers that currently write to a CDC log table.
CDC metrics, apart from tracking "total" metrics for all performed CDC
operations, also track metrics for "failed" operations. Because the
result of the CDC operation depends on whether all CDC mutations were
written successfully by storage_proxy, checking for failure and
incrementing appropriate counters is deferred after all write response
handlers finish.
The `cdc::operation_result_tracker` object was created for that purpose.
It contains all the details needed to accurately update the metrics
based on what actually happened in the `augment_mutation_call` function,
and holds a flag which tells if any of write response handlers failed.
This object is supposed to be referenced by write response handlers for
CDC mutations created after the same `augment_mutation_call`. After all
write response handlers are destroyed, the destructor of
`operation_result_tracker` will update appropriate metrics.
Actual creating and attaching this object to write response handlers
will be done in subsequent commits.
Modifies the transformer::transform so that it also returns a set of
flags indicating what parts of the mutation (e.g. rows, tombstones,
collections, etc.) were processed during transforming.
This patch defines a CDC metrics object and registers all of its
counters.
storage_proxy is chosen as the owner of the metrics object. Because in
subsequent commits it will become possible for CDC metrics to be updated
after a write operation ends, and because the cdc_service has shorter
lifetime than storage_proxy, we could risk a use-after-free if we placed
this object inside cdc_service.
This patch fixes a bug in mutation splitting logic of CDC. In the part
that handles updates of non-atomic clustering columns, the column
definition was fetched from a static column of the same id instead of
the actual definition of the clustering column. It could cause the value
to be written to a wrong column.
Tests: unit(dev)
This implements support for triggering major compations through the REST
API. Please note that "split_output" is not supported and Glauber Costa
confirmed this this is fine:
"We don't support splits, nor do I think we should."
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
"
Make sure all headers compile on their own, without requiring any
additional includes externally.
Even though this requirement is not documented in our coding guides it
is still quasi enforced and we semi-regularly get and merge patches
adding missing includes to headers.
This patch-set fixes all headers and adds a `{mode}-headers` target that
can be used to verify each header. This target should be built by
promotion to ensure no new non-conforming code sneaks in.
Individual headers can be verified using the
`build/dev/path/to/header.hh.o` target, that is generated for every
header.
The majority of the headers was just missing `seastarx.hh`. I think we
should just include this via a compiler flag to remove the noise from
our code (in a followup).
"
* 'compiling-headers/v2' of https://github.com/denesb/scylla:
configure.py: add {mode}-headers phony target
treewide: add missing headers and/or forward declarations
test/boost/sstable_test.hh: move generic stuff to test/lib/sstable_utils.hh
sstables: size_tiered_backlog_tracker: move methods out-of-line
sstables: date_tiered_compaction_strategy.hh: move methods out-of-line
* seastar 3c498abcab...92c488706c (14):
> dpdk: restore including reactor.hh
> tests: distributed_test: add missing #include <mutex>
> reactor: un-static-ify make_pollfn()
> merge: Reduce inclusions of reactor.hh
A few #includes added to compensate for this
> sharded: delete move constructor
> future: Avoid a move constructor call
> future: Erase types a bit more in then_wrapped
> memory: Drop a never nullopt optional
> semaphore: specify get_units and with_semaphore as noexcept
> spinlock.hh: Add include for <cassert> header
> dpdk: Avoid a variable sized array
> future: Add an explicit promise member to continuation
> net: remove smart pointer wrappers around pollable_fd
> Merge "cleanup reactor file functions" from Benny
This patch fixes a bug in mutation splitting logic of CDC. In the part
that handles updates of non-atomic clustering columns, the schema for
serializing that column was looked up incorrectly in the table schema -
instead of a `regular_column`, a `static_column` was looked up.
Due to how the `column_at` function works, a correct schema was always
returned if the table had no static columns. Therefore, in order for
this bug to manifest, a table with a static column and a regular column
with non-atomic collection was needed.
Closes#3295
The error_injection class allows injecting custom handlers into normal control
flow at the pre-determined injection points.
This is especially useful in various testing scenarios:
* Throwing an exception at some rare and extreme corner-cases
* Injecting a delay to test for timeouts to be handled correctly
* More advanced uses with custom lambda as an injection handler
Injection points are defined by `inject` calls.
Enabling and disabling injections are done by the corresponding
`enable` and `disable` calls.
REST frontend APIs is provided for convenience.
Branch URL: https://github.com/alecco/scylla/tree/as_error_injection
Tests: unit {{dev}}, unit {{debug}}
* 'as_error_injection' of github.com:alecco/scylla:
api: add error injection to REST API
utils: add error injection
sstable_test.hh started as collection of utilities shared between the
various `_sstable_test.cc` files. Predictably other tests started using
it as well, among them some that are non boost unit tests. This poses a
problem as if we add the missing boost/test/unit_test.hpp include to
sstable_test.hh these tests will suddenly have missing symbols from
boost::test. To avoid linking boost::test into all these users, extract
utilities more widely used into sstable_utils.hh
We now have a utils file for SSTables. This is potentially useful for
other tests.
As a matter of fact, this function is repeated right now for the
resharding test. And to add insult to injury, the version in the
resharding test has the parameters shard and number of tokens flipped,
which although extremely confusing is the predictable outcome of
such repetition
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The following error was seen in
materialized_views_test.py:TestMaterializedViews.decommission_node_during_mv_insert_4_nodes_test
INFO [shard 0] repair - repair id 3 to sync data for
keyspace=ks, status=started repair/repair.cc:662:36: runtime error: member call
on null pointer of type 'const struct schema'
Aborting on shard 0.
The problem is in the test a keyspace was created without creating any
table. Since db19a76b1f(selective_token_range_sharder: stop calling
global_partitioner()), in get_partitioner_for_tables, we access nullptr
when no table is present.
schema_ptr last_s;
for (auto t: tables) {
// set last_s
}
last_s->get_partitione()
To fix:
1) Skip the repair in sync_data_using_repair if there is no table in the keyspace
2) Throw if no schema_ptr is found in get_partitioner_for_tables. Be defensive.
After:
INFO [shard 0] repair - decommission_with_repair: started with keyspace=ks, leaving_node=127.0.0.2, nr_ranges=744
INFO [shard 0] repair - repair id 3 to sync data for keyspace=ks, status=started
WARN [shard 0] repair - repair id 3 to sync data for keyspace=ks, no table in this keyspace
INFO [shard 0] repair - repair id 3 completed successfully
INFO [shard 0] repair - repair id 3 to sync data for keyspace=ks, status=succeeded
Tests: materialized_views_test.py:TestMaterializedViews.decommission_node_during_mv_insert_4_nodes_test
Fixes: #6022
repair: Ignore keyspace that is removed in sync_data_using_repair
When a keyspace is removed during node operations, we should not fail
the whole operation. Ignore the keyspace that is removed.
Fixes#5942
* asias-repair_fix_5942:
repair: Stop the nodes that have run repair_row_level_start
repair: Ignore keyspace that is removed in sync_data_using_repair
Seems like adduser in redhat variants and deiban variants are incompatible,
and there is no addgroup in redhat variants.
Since adduser in install.sh is implemented on debian variants, does not work on redhat compatible.
To fix this we need to use 'useradd' / 'groupadd' instead.
Fixes#6018
"
In debug mode some tests take veeery looong time to finish,
those tests are better to be started first. This set adds
this by marking such long tests in suite.yaml files.
Tests: unit(dev)
"
* 'br-split-unit-tests-sorting-2' of https://github.com/xemul/scylla:
test.py: Mark some tests as "run_first"
test.py: Generate list with short names
test.py: Rename "long" to "skip_in_debug_mode"
Primary key and clustering key column should not have a corresponding
"cdc$deleted_<name>" column in cdc log table, because it does not make
sense to delete such a column from a row.
Fixes: #6049
Tests: unit(dev)
This reverts commit 0b34d88957. According
to Rafael Avila de Espindola:
"I have bisected the recent failures [in commitlog_test] on next to this
patch."
This reverts commit 458ef4bb06. According
to Glauber Costa:
"It may give us the illusion that fixes something for a particular case
but this fix is wrong.
I am trying to help Raphael figure out why the backlog is wrong but
this patch is not the answer."
Simple REST API for error injection is implemented.
The API allow the following operations:
* injecting an error at given injection name
* listing injections
* disabling an injection
* disabling all injections
Currently the API enables/disables on all shards.
Closes#3295
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Error injection class is implemented in order to allow injecting
various errors (exceptions, stalls, etc.) in code for testing
purposes.
Error injection is enabled via compile flag
SCYLLA_ENABLE_ERROR_INJECTION
TODO: manage shard instances
Enable error injection in debug/dev/sanitize modes.
Unit tests for error injection class.
Closes#3295
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Running the Alternator tests is easy after you manually run Scylla, but
sometimes it's convenient to have a script which just does everything
automatically: start Scylla in a temporary directory, set it up properly
for the tests (especially the authentication), run all the tests, and remove
the temporary directory. This is what this alternator-tests/run script does.
This script can be run by Jenkins, for example, to check all the Alternator
tests. The script assumes some things (including cqlsh, pytest and the boto3
library) are already installed, and that Scylla has been compiled - by
default it takes the latest built build/*/scylla, but this can be overridden
by a command like
SCYLLA=build/release/scylla alternator-test/run
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200311091918.16170-1-nyh@scylladb.com>
Currently the initial vertice of the graph is resolved in both
`_traverse_object_graph_breadth_first()` and its caller
`_do_generate_object_graph()`. This is redundant, so remove the
resolving in the latter.
Currently, for edges, only the "from" offset is printed, that is the
offset of the reference in the originating object. Now that we also scan
the non-first word of objects for references to them, we can have
reference pointing to the non-first word of objects. To make these
apparent, also print the "to" offset on edges, that is the offset into
the target object where the reference point to. So now edges have tuple
labels: (from, to).
Merged patch series from Piotr Sarna:
This series hooks alernator to admission control, similarly to how
CQL server uses it. The estimated memory consumption is set to 2x
raw JSON request, since that seems to be the upper limit of
how much more memory rapidjson allocates during parsing.
Note, that since Seastar HTTP currently reads the whole contents
upfront, there's no easy way to apply admission control before reading
the request - that would involve some changes to our HTTP API.
Note 2: currently, admission control in CQL does not properly pass
memory consumption information for requests that are bounced
to another shard - that would require either transferring semaphore units
between shards or keeping a foreign pointer to the original units.
As a result, alternator also does not pass correct admission control
info between shards, and all places in code which do that are marked
with clear FIXMEs.
Fixes#5029
Piotr Sarna (5):
storage_service: add memory limiter semaphore getter
alternator: add service permit to callbacks
alternator: add memory limiter to alternator server
alternator: add addmission control stats entry
alternator: hook admission control to alternator server
alternator/executor.cc | 113 ++++++++++++++++++++++--------------
alternator/executor.hh | 32 +++++-----
alternator/rmw_operation.hh | 1 +
alternator/server.cc | 83 +++++++++++++++-----------
alternator/server.hh | 8 ++-
alternator/stats.cc | 2 +
alternator/stats.hh | 1 +
main.cc | 3 +-
service/storage_service.hh | 4 ++
9 files changed, 149 insertions(+), 98 deletions(-)
Before this patch, when db/view/view.hh was modified, 89 source files had to
be recompiled. After this patch, this number is down to 5.
Most of the irrelevant source files got view.hh by including database.hh,
which included view.hh just for the definition of statistics. So in this
patch we split the view statistics to a separate header file, view_stats.hh,
and database.hh only includes that. A few source files which included
only database.hh and also needed view.hh (for materialized-view related
functions) now need to include view.hh explicitly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200319121031.540-1-nyh@scylladb.com>
When looking for references to an object in the graph, look for
references to any part of the object, using `scylla_find.find()`:s new
`value_range` parameter.
This way, the graph can be extended beyond objects that are members of
an intrusive containers, or just generally don't have any references to
their very first byte.
Allow the user to specify a value-range different than the size of the
object. This is useful if it is known that references to the object will
point to the first N bytes.
One of the most common use-cases of find is finding references to an
object. This works great for normal objects, however not for all of
them, a prominent example being objects that are members of an intrusive
collections. These objects will have pointers to them that don't point
to their first byte, instead they point to somewhere in the middle of
the object. To help find such references, find now supports searching
for a range of values. If the new `--value-range` option is used, it
will start searching for the value itself, and if no usages are found it
will increment it with the specified size-class, and search again. This
is repeated until some usages are found or the range is depleted.
`scylla_find.find()` now returns the offset to the value, of which
usages were found. Alternatively one can scan the entire value-range
using the `--find-all` option. When this is used, `scylla_find` will not
stop on the first offset for which references are found.
find_in_live() currently parses back the output of `scylla ptr`, to
return the address to the beginning of the object and the offset. All
its current callers do the call to `scylla ptr` again to obtain further
information about the object. To avoid this duplicated effort, return
`pointer_metadata` instances from `find_in_live()`, obtained via
`scylla_ptr.analyze()` which is the python API to `scylla ptr`.
The `result_callback` was a callback returned by `augment_mutation_call`
that was supposed to be used in the CDC postimage implementation.
Because CDC postimage was implemented without using this callback, and
currently a no-op function is always returned, this callback can safely
be removed.
Those tests take long time to finish, so it makes sense to start
them earlier than others.
The provided list of long tests consists of those running more
than 10 minutes in debug mode.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It returns a future, so converting an exception to an exceptional
future simplifies error handling in the caller.
Without this code like the one in
standard_role_manager::create_metadata_tables_if_missing has a
surprising behavior:
return when_all_succeed(
create_metadata_table_if_missing(...),
create_metadata_table_if_missing(...));
Since it might not wait for both futures. We could use the lambda
version of when_all_succeed, but changing
create_metadata_table_if_missing seems a nice API improvement.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200317002051.117832-4-espindola@scylladb.com>
View updates sent as part of the view building process should never
be ignored, but fd49fd7 introduced a bug which may cause exactly that:
the updates are mistakenly sent to background, so the view builder
will not receive negative feedback if an update failed, which will
in turn not cause a retry. Consequently, view building may report
that it "finished" building a view, while some of the updates were
lost. A simple fix is to restore previous behaviour - all updates
triggered by view building are now waited for.
Fixes#6038
Tests: unit(dev),
dtest: interrupt_build_process_with_resharding_low_to_half_test
The "long" test will mean that it is to be started first, not
skipped, so rename "long" to avoid additional confusion
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When qualifying columns to be fetched for filtering, we also check
if the target column is not used as an index - in which case there's
no need of fetching it. However, the check was incorrectly assuming
that any restriction is eligible for indexing, while it's currently
only true for EQ. The fix makes a more specific check and contains
many dynamic casts, but these will hopefully we gone once our
long planned "restrictions rewrite" is done.
This commit comes with a test.
Fixes#5708
Tests: unit(dev)
The intention of the code was to clear sharding metadata
chunked_vector so that it doesn't bloat memory.
The type of c is `chunked_vector*`. Assigning `{}`
clears the pointer while the intended behavior was to reset the
`chunked_vector` instance. The original instance is left unmodified
with all its reserved space.
Because of this, the previous fix had no effect because token ranges
are stored entirely inline and popping them doesn't realease memory.
Fixes#4951
Tests:
- sstable_mutation_test (dev)
- manual using scylla binary on customer data on top of 2019.1.5
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <1584559892-27653-1-git-send-email-tgrabiec@scylladb.com>
An exception thrown after the start of auth_service and before
init_server_without_the_messaging_service_part returns would cause the
sharded<auth_service> destructor to assert.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200317002051.117832-2-espindola@scylladb.com>
When investigating OOM related cores, a common thing to do is trying to
identify the objects in a particularly heavily populated size-class.
This command is meant to help with that, providing a way to list the
objects in any size-class, in a paginated way.
Traversing the objects of a pool is done through a
`small_object_iterator` object which is also exposed to python code, to
be used in custom scripts wanting to scan all objects belonging to a
pool.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200318085437.452906-1-bdenes@scylladb.com>
If delete_atomically() was called with a empty set for any reason,
it will fail to work because it relies on any of the sstables in
the set for getting the sstable directory.
This will be needed, in the future, when using sstable replacement
function only with new sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200305144657.9440-1-raphaelsc@scylladb.com>
There's such an option, and it's not taken into account
on scylla start. There's a symmetrical start_rpc one, which
is, so make both act similarly.
The default value for the option is true, so default set-ups
will not get broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200310140518.29410-1-xemul@scylladb.com>
"
Since b783d40aa storage-proxy maintains separate coordinator stats per
scheduling group. This broke scylla_memory, which was still trying to
access the old global stats. This mini-series updates it to be able to
handle per-sg coordinator stats, while preserving backward compatibility
with older versions still using global stats.
"
* 'scylla-memory-per-sg-coordinator-stats/v1' of https://github.com/denesb/scylla:
scylla-gdb.py: scylla_memory: update w.r.t. per-sg coordinator stats
scylla-gdb.py: scylla_memory: move coordinator code to print_coordinator_stats()
"
The debug mode unit tests take ~half-an-hour to complete. Here's
the tests run-times top list
Test: Time (seconds):
... steady tail goes here ...
test/boost/user_function_test 496
test/boost/row_cache_test 502
test/boost/view_schema_test 932
test/boost/cql_query_test 997
test/boost/mutation_reader_test 1048
test/boost/sstable_mutation_test 1417
test/boost/secondary_index_test 1468
Splitting the spike (top-5) is the primary goal. However, the
distribution of test-cases in 3 of those tests is also _very_
non-uniform, so just cutting it into equal parts doesn't work.
For example, the test_index_with_paging from the slowest one
takes ~14 minutes on its own and is the slowest test-case out
there.
So the set does this:
- moves the champion test_index_with_paging into separate file
- detaches the most heavy parts from sstable_mutation_test and
mutation_reader_test into own tests too. The resulting split
is still non-uniform, but it's 4 tests that run notably less
than the 14 minutes record each
- splits the cql_query_test and view_schema_test into several
parts in a wildcard manner to run out of the 14 min threshold
- moves some shared code into lib/
As the result, the debug mode test run takes 14.5 minutes =)
which is almost 2 times faster than it was. The dev mode run
time is not affected noticeably.
Test: well, unit(debug) and unit(dev)
"
* 'br-split-unit-tests-3-next' of https://github.com/xemul/scylla:
test: Split view_schema_test
test: Split cql_query_test
test: Split mutation_reader_test
test: Split sstable_mutation_test
test: Split secondary_index test
Consider
1. Start n1, n2 in the cluster
2. Stop n2 and delete all data for n2
3. Start n2 to replace itself with replace_address_first_boot: n2
4. Kill n2 before n2 finishes the replace operation
5. Remove replace_address_first_boot: n2 from scylla.yaml of n2
6. Delete all data for n2
7. Start n2
At step 7, n2 will be allowed to bootstrap as a new node, because the
application state of n2 in the cluster is HIBERNATE which is not
rejected in the check of is_safe_for_bootstrap. As a result, n2 will
replace n2 with a different tokens and a different host_id, as if the
old n2 node was removed from the cluster silently.
Fixes#5172
Check that SELECT statement checks there is a partition key before
accessing it when determining the shard to execute the query on.
Essentially move the check for properly restricted partition key
from storage_proxy.cc to select_statement.cc, now that we access
it earlier in the call stack.
Keep the check in storage_proxy.cc since storage_proxy::query()
has other call sites (views), which today should never use
serial consistency for its queries, but this can change in the future.
Please note that Cassandra only partially enforce SERIAL consistency
and can silently downgrade SERIAL consistency to the default
non-serial one when doing unbounded SELECTS (
https://issues.apache.org/jira/browse/CASSANDRA-15641)
Fixes#6016
Detach *partition_key* and *clustering_key* ones into own files.
The resultint 2 tests run ~4 minutes each, the leftover ones
complete within 11 minutes. The same -- the goal to run out of
14 minutes is reached, further splitting needs more thinking
than just wildcarding.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This detaches *like_operator*, *group_by*, *functions*
and *large* cases into own files. The split is not
uniform -- the resulting 4 tests run less that 3 minutes
each, what's left in the origin runs ~11 minutes. But
since the goal was to get out of 14 minutes threshold
and this file contains 126 cases (the champion) so I
just did "wildcard" selection that worked.
It also required moving require_rows() helpers into a
local header.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Detach test_multishard_combining_reader_as_mutation_source into
individual file.
This particular test runs ~13 minutes. What's left in the origin
completes a bit faster.
The split also requires moving the reader_lifecycle_policy and
the dummy_partitioner into lib/
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Detach test_schema_changes and test_sstable_conforms_to_mutation_source
into individual files. These two take ~10 minutes each, what's left in
origin finishes within 4 minutes alltogether.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Detach test_index_with_paging into individual file.
This particular test-case is the longest one in the sute,
it takes ~14 minutes to run, further splitting of this
test is pointless (for now) and all subsequent splits in
this set just make the resulting times less than this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The call for merge_schema_from in some cases is run in the
background and thus is not aborted/waited on shutdown. This
may result in use-after-free one of which is
merge_schema_from
-> read_schema_for_keyspace
-> db::system_keyspace::query
-> storage_proxy::query
-> query_partition_key_range_concurrent
in the latter function the proxy._token_metadata is accessed,
while the respective object can be already free (unlike the
storage_proxy itself that's still leaked on shutdown).
Related bug: #5903, #5999 (cannot reproduce though)
Tests: unit(dev), manual start-stop
dtest(consistency.TestConsistency, dev)
dtest(schema_management, dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Reviewed-by: Pekka Enberg <penberg@scylladb.com>
Message-Id: <20200316150348.31118-1-xemul@scylladb.com>
"
Allow adding compacting to any reader pipeline. The intended users are
streaming and repair, with the goal to prevent wasting transfer
bandwidth with data that is purgeable.
No current user in the tree.
Tests: unit(dev), mutation_reader_test.compacting_reader_*(debug)
"
* 'compacting-reader/v3' of https://github.com/denesb/scylla:
test: boost/mutation_reader_test: add unit test for compacting_reader
test: lib/flat_mutation_reader_assertions: be more lenient about empty mutations
test: lib/mutation_source_test: make data compaction friendly
test: random_mutation_generator: add generate_uncompactable mode
mutation_reader: introduce compacting_reader
When expecting a mutation that compacts to an empty one, allow it to be
not produced at all. After all, compaction normally doesn't even emits
empty partitions.
Currently the mutation source test suite may generate data that is
compactable. This poses a problem for the next patch, where we want to
use it to test `compacting_reader` a reader which compacts data as it
reads it. When the input is compactable, this will introduce artificial
differences, failing the tests.
To allow also testing such readers, make sure data is not compactable,
i.e. compacting it will not change it.
The goal of the mutation source test suite is not to exercise compaction
logic, so this will not take anything away from its value.
The random mutation generator currently generates data and tombstones
with random timestamps selected from a pre-determined range. This
results in mutations where tombstones often cover each other and data.
There is nothing wrong with this, as this is how real data is too.
However for certain tests this is problematic, as compacting the
mutations will result in a different mutations. To cater for these users
too, introduce a `generate_uncompactable` option. When set to `yes`, the
generated mutations will be uncompactable, i.e. no tombstone will cover
lower-level tombstones and no tombstone will cover data. The mutations
will not change after compacted.
Compacting reader compacts the output of another reader on-the-fly.
Performs compaction-type compaction (`compact_for_sstables::yes`).
It will be used in streaming and repair to eliminate purgeable data from
the stream, thus prevent wasting transfer bandwidth.
Merged pull request https://github.com/scylladb/scylla/pull/5996 from
Calle Wilund:
Fixes#4992
Implements post-image support by synthesizing it from
pre-image + delta.
Post-image data differs from the delta data in two ways:
1.) It merges non-atomics into an actual result value
2.) It contains all columns of the row, not just
those affected by the update.
For a non-atomic field, the post-image value of a column
is either the pre-image or the delta (maybe null)
Tested by adding post-image checks to pre-image test
and collection/udt tests
Fixes#4992
Implements post-image support by synthesizing it from
pre-image + delta.
Post-image data differs from the delta data in two ways:
1.) It merges non-atomics into an actual result value
2.) It contains _all_ columns of the row, not just
those affected by the update.
For a non-atomic field, the post-image value of a column
is either the pre-image or the delta (maybe null)
Tested by adding post-image checks to pre-image test
and collection/udt tests
"
This PR makes it possible to enable the usage of different partitioner for each table. If no table-specific partitioner is set for a given table then a default partitioner is used.
The PR is composed of the following parts:
- Introduction of schema::get_partitioner that still returns dht::global_partitioner
- Replacement of all the usage of dht::global_partitioner with schema::get_partitioner
- Making it possible to set table-specific partitioner in a schema_builder
- Remove all the places that were setting default partitioner except for main.cc (mostly tests)
- Move default partitioner from i_partitioner to schema.cc and hide it from the rest of the codebase
- Remove dht::global_partitioner
After this PR there's no such thing as global partitioner at all. There is only a default partitioner but it still has to be accessed through schema::get_partitioner.
There are some intermediate states in which i_partitioner is stored as shared_ptr in the schema but the final version keeps it by const&.
The PR does not enable per table partitioner end-to-end. Just the internals of the single node are covered. I still have to deal with:
- Making sure a table has the same partitioner on each node
- Allowing user to set up a table-specific partitioner on table
- Signal driver about what partitioner is used by a given table
- Persist partitioner info for each table that does not use default partitioner.
Fixes#5493
Tests: unit(dev, release, debug), dtest(byo)
"
* 'per_table_partitioner' of https://github.com/haaawk/scylla:
schema: drop optional from _partitioner field
make_multishard_combining_reader: stop taking partitioner
split_range_to_single_shard: stop taking partitioner as argument
tests: remove unused murmur3 includes
partitioner: move default_partitioner to schema.cc
partitioner: hide dht::default_partitioner
schema: include partitioner name in scylla tables mutation
schema: make it possible to set custom partitioner
scylla_tables: add partitioner column
schema_features: add PER_TABLE_PARTITIONERS feature
features: add PER_TABLE_PARTITIONERS feature
* seastar 47d929dd1...3c498abca (5):
> reactor: Use do_with to save stack space
> reactor: Extract code into a schedule_retry helper
> reactor: Move an io_event buffer out of the stack
> temporary_buffer: fix typo in argument type in comparison operators
> tests: tls_test: add missing include <iostream>
This adds a warning with a different limit in each mode. The limit is
picked as 1KiB lower than the value where no warning would be print.
This makes it easy to spot the worse offender. With that we can either
fix it or silence the warning once we are sure we can handle large
frames in that context.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200311205300.324383-1-espindola@scylladb.com>
The bug is that we failed to implement this part of the formula:
(T - C) * log4(T)
We were incorrectly implementing it as:
(T - C) * log4(T - C)
So it could result in a backlog being calculated as negative when it
should actually be positive, or backlog being lower than expected.
BTW, we do protect against negative backlog after commit 3e08bd17f0.
Given that STCS backlog tracker is inherited by TWCS and LCS trackers,
all compaction strategies are affected.
The formula to calculate the aggregate backlog is:
A = (T - C) * log4(T) - Sum(i = 0...N) { (Si - Ci)* log4(Si) }.
For example, negative backlog is calculated on a tested scenario where T
was 3129, C was 2337 and Sum(i = 0...N) { (Si - Ci)* log4(Si) } resulted
in 4222.53.
(T - C) * log4(T - C) = (3129 - 2337) * log4(3129 - 2337) = 3813.23
So backlog is negative because A = 3813.23 - 4222.53 = -409.302
But it should actually be calculated as follow:
(T - C) * log4(T) = (3129 - 2337) * log4(3129) = 4598.15
And the correct backlog is positive, as A = 4598.15 - 4223.53 = 375.621
Fixes#6021.
tests: unit(dev)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200315153711.23302-1-raphaelsc@scylladb.com>
When the user performed
alter ks.t with compaction = {...}
the values of most other options, which were not specified in the
statement, e.g. compression, were left unchanged. That wasn't true for
extension options however: for example, the "cdc" option was removed.
This commit fixes the behavior to keep the old values of extension
options not specified in the alter statement.
The function already takes schema so there's no need
for it to take partitioner. It can be obtained using
schema::get_partitioner
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Remove last usage of this global outside i_partitioner.cc
and hide it inside the compilation unit.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
There are two results of this patch:
1. New partitioner name column is persited on node's disk in scylla_tables
2. New partitioner name column is included into schema digest
This is achieved by including this new column in scylla tables mutation.
For that we:
1. Add partitioner name to the result of make_scylla_tables_mutation.
If table does not have a specific partitioner set and uses default
partitioner then we don't include the name of such default partitioner.
Only the name of custom partitioner is added if a table has one.
2. In create_table_from_mutations we check whether scylla tables mutation
has a partitioner name set. If so then we use it as a parameter for
schema_builder.
Note that previous patches have ensured that this new column will be included
into schema digest only after the whole cluster supports per table partitioners.
Before that, during rolling upgrade, new partitioner name column is hidden and
not shared with other nodes.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
schema_builder::with_partitioner can be used now to
set custom partitioner on a table.
If no such partitioner is set, global partitioner is
still used.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Following commits make it possible to set a specific
partitioner for a table. We want to persist that information
and include it into schema digest. For that a new column
in scylla_tables is needed. This commit adds such column.
We add the new column to scylla_tables because it's a Scylla
specific extension.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
With per table partitioners, partitioner name will be a part
of table schema. To allow rolling upgrade we need to perform
special logic that hides new partitioner name schema column
during the upgrade. This commit adds new schema feature that
controls this logic.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This new feature is required because we now allow
setting partitioner per table. This will influence
the digest of table schema so we must not include
partitioner name into the digest unless we know that
the whole cluster already supports per table partitioners.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Subscript operation `__getitem__()` was only added to re.match objects
in 3.6. To support previous versions, use `groups()` method to obtain
the desired group.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200313160025.319464-1-bdenes@scylladb.com>
While CQL does not allow creation of a materialized view with more than one
base regular column in the view's key, in Alternator we do allow this - both
partition and clustering key may be a base regular column. We had a bug in
the logic handling this case:
If the new base row is missing a value for *one* of the view key columns,
we shouldn't create a view row. Similarly, if the existing base row was
missing a value for *one* of the view key columns, a view row does not
exist and doesn't need to be deleted. This was done incorrectly, and made
decisions based on just one of the key columns, and the logic is now
fixed (and I think, simplified) in this patch.
With this patch, the Alternator test which previously failed because of
this problem now passes. The patch also includes new tests in the existing
C++ unit test test_view_with_two_regular_base_columns_in_key. This tests
was already supposed to be testing various cases of two-new-key-columns
updates, but missed the cases explained above. These new tests failed
badly before this patch - some of them had clean write errors, others
caused crashes. With this patch, they pass.
Fixes#6008.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200312162503.8944-1-nyh@scylladb.com>
"
It doesn't look like we will be able to switch to std::string just
yet, but when it is not too inconvenient we can try to reduce our
dependence so that attempting the switch again in the future is
easier.
"
* 'espindola/sstring-api' of https://github.com/espindola/scylla:
redis: Use scattered_message::append(std::string_view)
everywhere: Use uninitialized_string instead of sstring::initialized_later
compressor: Add an explicit cast to const sstring&
everywhere: Be more explicit that we don't want std::make_shared
cql3: Don't use sstring::reset
everywhere: Don't assume sstring::begin() and sstring::end() are pointers
make_directory_for_column_family() is used in a parallel_for_each() in
parse_system_tables(). Because parallel_for_each does not preempt
in the initial execution of its input function, and because each thread
allocates 128k for the stack, we end up allocating many hundreds of
megabytes if there are many tables.
This happens early during boot and will only cause problems if
there are 5,000 tables per gigabyte of shard memory, and unlikely
combination that will probably fail later, but still it is better to
avoid unnecessary large allocations.
This was developed in order to fix#6003, until it was discovered that
c020b4e5e2 ("logalloc: increase capacity of _regions vector
outside reclaim lock") is the real fix.
Message-Id: <20200313093603.1366502-1-avi@scylladb.com>
In test_services.cc there is
gms::feature_service test_feature_service;
And the feature_service constructor has
, _lwt_feature(*this, features::LWT)
But features::LWT is a global sstring constructed in another file.
Solve the problem by making the feature strings constexpr
std::string_view.
I found the issue while trying to benchmark the std::string switch.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Acked-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200309225749.36661-1-espindola@scylladb.com>
Since b783d40aa storage-proxy maintains separate coordinator stats per
scheduling group. This broke scylla_memory, which was still trying to
access the old global stats. Update it to print the new per-scheduling
group stats when they are available and the old global ones when not.
Scheduling groups for which all relevant metrics are 0 are omitted from
the printout to reduce noise.
In order to avoid the UI merge button which tends to
mess up commit authors, a simple script for pulling
a PR from GitHub is added.
Example usage:
$ git fetch; git checkout origin/next
$ ./scripts/pull_github_pr.sh 6007
Message-Id: <1fa79c8be47b5660fc24a81fc0ab381aa26d98af.1584014944.git.sarna@scylladb.com>
This patch adds a test, test_gsi.py::test_gsi_missing_attribute_3,
reproducing issue #6008. The issue is about a GSI with *two*
regular base columns becoming key columns in a view, and we have
a write failure when writing an item with one of these attributes
missing.
The test passes on DynamoDB, currently xfails on Alternator.
Refs #6008.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200312064131.16046-1-nyh@scylladb.com>
Recently, Materialized Views were modified (see issue #4365) so that local
view updates (when both base and view replicas are the same node) are
synchronous. In particular, when the view's partition key is the same as
the base table's, view writes are synchronous: A write now only returns
after CL copies of the view data have been written.
Alternator's LSI have exactly this case (same partition key as the base).
This makes strongly-consistent (CL=LOCAL_QUORUM) reads in Alternator work
correctly, so we update the documentation accordingly to no longer say
that we don't support this DynamoDB feature.
However unlike LSIs, for GSIs strongly-consistent reads are still not
supported, and should not be supported (they are also not supported by
DynamoDB). Such reads should generate an error. So this patch fixes this
too. A GSI test which tested that strongly consistent reads are forbidden,
which used to xfail, now passes so the patch removes the "xfail".
Finally, we can simplify the LSI tests by using consistent reads instead of
eventually-consistent reads with retries. Beyond simplifying the test, it's
also an opportunity to *use* strongly-consistent reads and make sure that
they work (while, as mentioned above, similar reads for GSIs are refused).
Fixes#5007
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200311170446.28611-1-nyh@scylladb.com>
* seastar 95f4277c16...664c911b4c (4):
> tls_test: Use uninitialized_string instead of initialized_later
> tls: Fix race and stale memory use in delayed shutdown
Fixes#5759 (maybe)
> tls: Re-enable TLS test and fix build+run
> tls: Set server name for client connection if available
Reclaim consults the _regions vector, so we don't want it moving around while
allocating more capacity. For that we take the reclaim lock. However, that
can cause a false-positive OOM during startup:
1. all memory is allocated to LSA as part of priming (2baa16b371)
2. the _regions vector is resized from 64k to 128k, requiring a segment
to be freed (plenty are free)
3. but reclaiming_lock is taken, so we cannot reclaim anything.
To fix, resize the _regions vector outside the lock.
Fixes#6003.
Message-Id: <20200311091217.1112081-1-avi@scylladb.com>
In theory the C++11 ABI should already have a size field but it does not
in the version of the C++ standard library shipped with scylla 2019.1.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200225162337.112582-1-bdenes@scylladb.com>
Currently if `startswith` is passed to `resolve()` it will
unconditionally try to match the resolved symbol name against it. This
will of course fail when the symbols fails to resolve and `name` is
`None`. Return early when this happens to prevent the unnecessary
prefix matching.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200310140918.88928-1-bdenes@scylladb.com>
The current method of obtaining the text range based on a known vptr
(`reactor::_backend`) was based on branch-2019.1, where
`reactor::_backend` is a value member. However in >=3.0
`reactor::_backend` is a `std::unique_ptr<>`. Adapt the code to work for
both.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200310135957.86261-1-bdenes@scylladb.com>
Merged patch series by Piotr Sarna:
This series makes view updates synchronous, as long as the update
is going to be applied locally.
With this feature, local secondary indexes and, more generally,
materialized views with partition keys same as in the base table
could enjoy more robust consistency.
This series comes with a cql test, not common for materialized
views, which usually require eventual consistency checks. With
synchronous updates however, the test can simply check view values
right after updating the base table.
Fixes#4365
Refs #5007
Tests: unit(dev), manually via inserting sleeps and debug messages,
to make sure that local view updates are actually waited for
Piotr Sarna (4):
db,view: drop default parameter for mutate_MV::allow_hints
db,view: move putting view updates to background to mutate_MV
db,view: perform local view updates synchronously
test: add a simple test for synchronous local view updates
With synchronous local view updates enabled, local materialized views
can be queried right after base table insertions, without the risk
of reading stale values.
Local view updates (updates applied to a local node,
without remote communication) are from now on performed
synchronously - which adds consistency guarantees, as a local
write failure will be returned to the client instead of being
silently ignored.
Currently, launching view updates as an asynchronous background job
is done via not waiting for mutate_MV() future in
table::generate_and_propagate_view_updates. That has a big downside,
since mutate_MV() handles *all* view updates for *all* views of a table,
so it's not possible to wait for each view independently.
Per-view granularity is required in order to implement synchronous
view updates of local views - because then we'll synchronously
wait for all views that write to a local node (due to having a matching
partition key with the base), while remote view updates will still
be sent asynchronously.
In order to do that, instead of not waiting for mutate_MV,
we do wait for it properly, but instead launch the asynchronous,
unwaited-for futures inside mutate_MV.
Effectively that means no changes for view updates so far - all updates
will be fired in the background. Later, another patch will introduce
a way to wait for selected updates to finish.
This is just a trivial wrapper over initialized_later when using
sstring, but also works when std::string is used.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Some difference on how exactly the operator== is declared for sstring
versus std::string requires this change if we convert from sstring to
std::string.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
If sstring is made an alias to std::string ADL causes std::make_shared
to be found. Explicitly ask for ::make_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
If we switch to using std::string we have to handle begin and end
returning iterators.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
It may be not safe to move sharded services, so it will be prohibited in
the future seastar update. Remove all current cases where we do it.
Fixes#5814.
Message-Id: <20200301095423.GY434@scylladb.com>
If the feature service is stopped without enabling some features,
the latrer may end up with "broken promise" exception on futures
attached to the _pr promise. Fix this by switching the only user
of it onto 'listener' API and remove future-based one.
Tests: unit(debug), manual start-stop and aborted-start
"
The cql_configu is needed by storage_service to feed it to
thrift/transport servers. These servers, in turn, put the
config onto query_options. The final goal of this config
reference is the guts of query_processor (but currently it's
only used by restrictions)
This way is rather long and confusing. It seems more natural
to keep the cql_config on it's main "user" -- query processor.
This patch set does so. However, in order to push the config
into its current usage places a huge refactoring is needed --
most of the classes in cql3/statements and cql3/restrictions.
It's much more handy to contunue keeping it via query_options,
so the query_processor is equipped with the method to return
the reference on the config to those initializing query_options.
Tests: unit(debug)
"
* 'br-clean-client-services-from-cql-config-2' of https://github.com/xemul/scylla:
storage_service: Forget cql_config
transport: Forget cql_config
thrift: Forget cql_config
query_processor: Carry reference on cql_config
The is_log_for_some_table function incorrectly assumed that
database::find_schema would return a null pointer in case the queried
schema does not exist. This patch fixes that, and now this function
checks for existence of the schema using database::has_schema.
Tests: unit(dev)
It is ok to run repair_row_level_stop unconditionally. The node that
hasn't received the repair_row_level_start will simply return an error
that the repair_meta_id is not found. To avoid the unnecessary
repair_row_level_stop verb, we can stop the nodes have run
repair_row_level_start. This also makes the error message less
confusing.
For example:
Before:
INFO 2020-03-09 15:55:43,369 [shard 0] repair - repair id 1 on shard 0
failed: std::runtime_error (get_repair_meta: repair_meta_id 8 for
node 127.0.0.4 does not exist)
INFO 2020-03-09 15:55:43,369 [shard 0] repair - repair id 1
failed: std::runtime_error ({shard 0: std::runtime_error
(get_repair_meta: repair_meta_id 8 for node 127.0.0.4 does not
exist)})
WARN 2020-03-09 15:55:43,369 [shard 0] repair - repair id 1 to
sync data for keyspace=ks, status=failed, keyspace does not exist
any more, ignoring it: std::runtime_error ({shard 0:
std::runtime_error (get_repair_meta: repair_meta_id 8 for node
127.0.0.4 does not exist)})
After:
INFO 2020-03-09 16:09:09,217 [shard 0] repair - repair id 1 on shard 0 failed:
std::runtime_error (Failed to repair for keyspace=ks, cf=cf,
range=(9041860168177642466, 9044815446631222376])
INFO 2020-03-09 16:09:09,217 [shard 0] repair - repair id 1 failed:
std::runtime_error ({shard 0: std::runtime_error (Failed to repair
for keyspace=ks, cf=cf, range=(9041860168177642466,
9044815446631222376])})
WARN 2020-03-09 16:09:09,217 [shard 0] repair - repair id 1 to sync data
for keyspace=ks, status=failed, keyspace does not exist any more,
ignoring it: std::runtime_error ({shard 0: std::runtime_error
(Failed to repair for keyspace=ks, cf=cf,
range=(9041860168177642466, 9044815446631222376])})
Refs #5942
It needs the config purely to feed one into thrift/transport
server, since the latter two no longer needs one, neither does
the former.
As a nice side effect -- some tests no longer have to carry
the cql_config on board.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
* seastar affc3a5107...5eaec672a2 (12):
> test_thread_custom_stack_size_failure: Use a larger custom stack
> test_thread_custom_stack_size: Use a larger custom stack
> log: correct help message
> perftune.py: verify NIC existence
> Merge "Fix various memory issues in http" from Rafael
> build: Fix IN_LIST usage
> future: Disable -Wuninitialized on a particular memcpy
> build: use IN_LIST for shorter cmake
> build: check support of "-fstack-clash-protection" before using it
> configure.py: Add "--verbose" flag
> configure.py: Make "cmake" command line human-readable
> net: dynamically adjust buffer sizes for posix connected_socket read operations
SimpleStrategy creates a list of endpoints by iterating over the set of
all configured endpoints for the given token, until we reach keyspace
replication factor.
There is a trivial coding bug when we first add at least one endpoint
to the list, and then compare list size and replication factor.
If RF=0 this never yields true.
Fix by moving the RF check before at least one endpoint is added to the
list.
Cassandra never had this bug since it uses a less fancy while()
loop.
Fixes#5962
Message-Id: <20200306193729.130266-1-kostja@scylladb.com>
Fixes#5899
When terminating (closing) a segment, we write a trailing block
of zero so reader can have an empty region after last used chunk
as end marker. This is due to using recycled, pre-allocated
segments with potentially non-zero data extending over the point
where we are ending the segment (i.e. we are not fully filling
the segment due to a huge mutation or similar).
However, if we reach end of segment writing the final block
(typically many small mutations), the file will end naturally
after the data written, and any trailing zero block would in fact
just extend the file further. While this will only happen once per
segment recycled (independent on how many times it is recycled),
it is still both slightly breaking the disk usage contract and
also potentially causing some disk stalls due to metadata changes
(though of course very infrequent).
We should only write trailing zero if we are below the max_size
file size when terminating
Adds a small size check to commitlog test to verify size bounds.
(Which breaks without the patch)
Message-Id: <20200226121601.15347-2-calle@scylladb.com>
Introduce a test which checks how different CQL features (DML, LWT,
MV) work when no replicas are available (e.g. because
they are all in an unavailable data center).
Specifically the test checks that when we SELECT with IN clause
and there are no available replicas, there is no crash (#5935).
Message-Id: <20200306192521.73486-3-kostja@scylladb.com>
The list of all endpoints for a query can be empty if we have
replication_factor 0 or there are no live endpoints for this token.
Do not access all_replicas.front() in this case.
Fixes#5935.
Message-Id: <20200306192521.73486-2-kostja@scylladb.com>
If base mutation has at least one row tombstone, its preimage log
entry is constructed from all the base columns.
Fixes#5709
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
In case a static and a clustering row is written at the same time, but
a clustering row with given key was not present, the preimage query was
incorrectly configured and no rows were returned. This resulted in an
empty preimage, while a preimage for static row should be present.
This patch fixes this and now the static row is correctly written to cdc
log in the case above.
Tests: unit(dev)
This change disallows creating CDC log tables for already existing
CDC log tables. CDC logs nested in that way are not really useful
and do not work at the moment, therefore disallowing their creation
prevents confusion.
Fixes#5967
Tests: unit(dev)
* piodul/5967-disallow-nested-cdc-logs:
cdc: disallow creating nested CDC logs
cql_repl: register schema extensions
... schema::get_partitioner and make schema::get_partitioner
return const&' from Piotr
Partitioners returned from get_partitioner are shared
and not supposed to be changed so let's use the type system
to enforce that.
dht::global_partitioner() is deprecated and will be removed
as soon as custom partitioners are implemented so it's best
to replace it with schema::get_partitioner.
Tests: unit(dev)
* hawk/global_partitioner_cleanup:
schema: get_partitioner return const&
compaction_manager: stop calling dht::global_partitioner()
sstable_datafile_test: stop calling dht::global_partitioner()
Previously the tokens were stored as strings
because token could have been represented in multiple ways.
Now token representation is always int64_t so we can
store them as ints in cdc description as well.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This change disallows creating CDC log tables for already existing CDC
log tables. CDC logs nested in that way are not really useful and do not
work at the moment, therefore disallowing their creation prevents
confusion.
Alternator and CDC, apart from enabling their experimental features,
need to have their schema extensions registered. This patch adds missing
registration of schema extensions to cql_repl, so that cql tests written
with Alternator or CDC in mind will properly work.
Fixes#5902 by making the LIKE restriction keep a vector
of matchers and apply them all to the column value.
Tests: unit (dev)
* dekimir/multiple-likes:
cql3: Allow repeated LIKE on same column
cql3: Forbid calling LIKE::values()
cql3: Move LIKE::_last_pattern to matcher
In order to properly validate not only network topology strategy,
but also other strategies, the checks are moved straight to
validate_replication_factor().
Also, the test case is extended with a too long integer
and a check for SimpleStrategy replication factor.
Fixes#3801
Tests: unit(dev)
Message-Id: <e0c3c3c36c589e1d440c9708a6dce820c111b8da.1583483602.git.sarna@scylladb.com>
Check that XML output of a test is valid and warn otherwise.
The following tests currently produce a warning:
boost/multishard_mutation_query_test
Message-Id: <20200305213501.52279-2-kostja@scylladb.com>
In order to prevent users from creating a network topology
strategy instance with invalid inputs, it's not enough to use
std::stol() on the input: a string "3abc" still returns the number '3',
but will later confuse cqlsh and other drivers, when they ask for
topology strategy details.
The error message is now more human readable, since for incorrect
numeric inputs it used to return a rather cryptic message:
ServerError: stol()
This commit fixes the issue and comes with a simple test.
Fixes#3801
Tests: unit(dev)
Message-Id: <7aaae83d003738f047d28727430ca0a5cec6b9c6.1583478000.git.sarna@scylladb.com>
... and reading cdc metadata' from Piotr
Currently, information on what cdc options are enabled
in a table - cdc metadata in short - is stored in two places:
In cdc column of the system_schema.scylla_tables,
In a cdc schema extension.
The former is used as a source of truth, i.e. a node reads cdc metadata
from that column, while the latter is used for cosmetic purposes
(e.g. cqlsh displays info on cdc based on this extension)
and is only written, but never read by the node.
Introducing the cdc column to scylla_tables made the logic
of schema agreement more complicated. As a first step of removing
this column, this PR makes the cdc schema extension as the
"source of truth" - a node will from now on read cdc metadata
from that extension.
The cdc column will be deprecated and removed in subsequent releases,
but it is left for now and will still be written to in order not to break
the logic of schema agreement.
Acked-by: Nadav Har-El <nyh@scylladb.com>
Refs: #5737
Tests: unit(dev), 2-node cluster upgrade under write load to a cdc-enabled table
* piodul/5737-cdc-schema-extension:
schema: get cdc options from schema extensions
alter_table_statement: fix indentation
cf_prop_defs: initialize schema extensions externally
cf_prop_defs: move checking of cdc support to ::validate
cf_prop_defs: pass database& to ::validate, not db::extensions&
unit tests: register cdc extension before tests
cdc: construct cdc_options directly inside cdc_extension
db::extensions: add shorthands for add_schema_extension
Moves initialization of schema extensions outside of cf_prop_defs. This
allows to construct these extensions once, and use them several times in
cd_prop_defs' methods without caching or recalculating them several
times.
Changes cf_prop_defs::validate function to take database& as an argument
instead of db::extensions&. This change will allow us to move the check
which asserts that the cluster supports CDC from `apply_to_builder` to
`validate` method.
Instead of storing a raw map of options inside `cdc_extension`, the
extension now converts them into `cdc_options` directly on construction.
This removes the need to construct `cdc_options` object multiple times.
With #5950 we changed the representation of stream_id
in CDC Log from two int columns to a single blob column.
This PR cleans up stream_id representation internally.
Now stream_id is stored as blob both in-memory and in
internal CDC tables.
Tests: unit(dev)
* hawk/stream_id_representation:
cdc: store stream_ids as blobs in internal tables
cdc: improve do_update_streams_description
cdc: Fix generate_topology_description
cdc: add stream_id::operator<
cdc: change stream_id representation
Fixes#5891
Refs #5899
When creating segments with the o_dsync option active, we write max_size
zeros to disk, to ensure actual disk blocks are allocated.
However, if we recycle a segment, we should, when not actually creating
a new file, check the existing size on disk, and only zero any blocks
not already allocated (i.e. if recycled file was smaller than max_size,
due to segement truncation on close).
test: unit
Message-Id: <20200226121601.15347-1-calle@scylladb.com>
from Kamil.
If a batch update is performed with a sequence of changes with a single
timestamp, they will now show up in CDC with a single timeuuid
in the cdc$time column, distinguished by different cdc$batch_seq_no values.
Fixes#5953.
Tests: unit(dev)
* haaawk/splitbatch:
cdc: use a single timeuuid value for a batch of changes
cdc: replace `split` with `for_each_change`
Now it will show the full info about range being streamed, like
range_streamer - Rebuild with 127.0.0.2 for keyspace=ks2, streaming [72, 96) out of 248 ranges
The [x, y) range is semi-open one, the full streaming progress
then can be logged like
... streaming [0, 16) out of 36 ranges <- first send
... streaming [16, 24) out of 36 ranges
... streaming [24, 36) out of 36 ranges <- last send
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200304101505.5506-1-xemul@scylladb.com>
If a batch update is performed with a sequence of changes with a single
timestamp, they will now show up in CDC with a single timeuuid in the
`time` column, distinguished by different `batch_seq_no` values.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Presently lightweight transactions piggy back the old
row value on prepare round response. If one of the participants
did not provide the old value or the values from peers don't match,
we perform a full read round which will repair the Paxos table and the
base table, if necessary, at all participants.
Capture the fact that read optimization has failed in a metric.
Message-Id: <20200304192955.84208-2-kostja@scylladb.com>
`for_each_change` is like `split` but it doesn't return a vector of
mutations representing each change; instead, it takes as a parameter
a function which gets called on each mutation.
This reduced the memory usage and allows to preserve common context
when handling each change (will be useful in next commits).
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The relocatable package requires a magic dynamic linker path for
"patchelf" to work correctly. Therefore, use the "get-dynamic-linker.sh"
script to unconditionally define a magic dynamic linker path to ensure
that building the relocatable package with ninja build ("ninja-build
build/<mode>/scylla-package.tar.gz") is always correct. Although the
path looks odd with a lot of leading slashes, it works outside
relocatable package too.
Message-Id: <20200305091919.6315-2-penberg@scylladb.com>
In preparation for moving dynamic linker flags to ninja build, move the
magic dynamic linker path generation to "reloc/get-dynamic-linker.sh"
script that configure.py can call.
Message-Id: <20200305084331.5339-1-penberg@scylladb.com>
In new CDC Log format stream_id is represented by a single
blob column so it makes sense to store it in the same form
everywhere - including internal CDC tables.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Use std::set::insert that takes range instead of
looping through elements and adding them one by one.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
In new CDC Log format we store only a single stream_id column.
This means generate_topology_description has to use appropriate
schema for generating tokens for stream_ids.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
New CDC Log format stores stream ids as blobs.
It makes sense to keep them internally in the same form.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Currently, writes to a static row in a base table are not reflected
at all in the corresponding cdc log. This patch causes such writes
to be properly logged.
Fixes: #5744
Tests: unit(dev)
* piodul/5744-handle-static-row-correctly-in-cdc:
cdc_test: add tests for handling static row
cdc: fix indentation in transformer::transform
cdc: handle static rows separately in transformer::transform
cdc: move process_cells higher (and fix captured variables)
cdc: reduce dependencies on captured variables in process_cells
cdc: fix preimage query for static rows
Until this patch, we used the default_smp_service_group() when bouncing
Alternator requests between shards (which is needed for LWT).
This patch creates a new smp_service_group for this purpose, which is
limited to 5000 concurrent requests (the same limit used for CQL's
bounce_request_smp_service_group). The purpose of this limit is to avoid
many shards admitting a huge number of requests and bouncing all of them
to the same shard who now can't "unadmit" these requests.
Fixes#5664.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200304170825.27226-1-nyh@scylladb.com>
We use boost test logging primarily to generate nice XML xunit
files used in Jenkins. These XML files can be bloated
with messages from BOOST_TEST_MESSAGE(), hundreds of megabytes
of build archives, on every build.
Let's use seastar logger for test logging instead, reserving
the use of boost log facilities for boost test markup information.
Now, if CDC is enabled, `paxos_response_handler::learn_decision()`
augments the base table mutation. The differences in logic between:
(1) `mutate_internal<std::vector<mutation>>()`
and
(2) `mutate_internal<std::vector<std::tuple<paxos::proposal, schema_ptr, ...>>>()`
make it necessary to separate "CDC mutations" from "base mutation"
and send them, respectively, to (1) and (2).
Gleb explained in #5869 why it became necessary to add CDC code to LWT
writes specifically, instead of doing it somewhere central that affects
all writes:
"All paths that do write goes through mutate_internally() eventually so it
would have been best to do augmentations there, but cdc chose to log only
certain writes and not others (unlike MV that does not care how write
happened) and mutate_internal have no idea which is which so I do not have
other choice but code duplication. ... paxos_response_handler::learn_decision
is probably the place to add cdc augmentation."
Fixes#5869
It is possible to produce an empty mutation using CQL. For example, the
following query:
DELETE FROM ks.tbl WHERE pk = 0 AND ck < 1 AND ck > 2;
will attempt to delete from an empty range of rows. This is translated
to the following mutation:
{ks.tbl {key: pk{000400000000}, token:-3485513579396041028}
{mutation_partition:
static: cont=1 {row: },
clustered: {}}}
Such mutation does not contain any timestamp, therefore it is difficult
to determine what timestamp was used while making the query. This is
problematic for CDC, because an entry in CDC log should be written with
the same timestamp as a part of the mutation.
Because an empty mutation does not modify the table in any way, we can
safely skip logging such mutations in CDC and still preserve the
ability to reconstruct the current state of the base table from full
CDC log.
Tests: unit(dev)
Before this patch, `transform` did not generate any log rows about
static row change. This commit fixes that - now, a log row is created if
a static row is changed, and this row is separate from the rows that
describe changes to the clustering rows.
This is a preparation for moving the lambda outside the for loop.
- `log_ck`, `pikey`, `pirow` are now passed as arguments,
- `value` is now a variable local to the lambda,
- `ttl` is now a variable local to the lambda that is returned.
Most test-methods log a message with their names upon entering them.
This helps in identifying the test-method a failure happened in in the
logs. Two methods were missing this log line, so add it.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200304155235.46170-1-bdenes@scylladb.com>
Regular compaction relies on compaction manager to run compaction jobs
until compaction strategy is satisfied. Resharding, on the other hand,
is an one-off operation which runs only once in compaction manager,
and leave the sstable set in such a way that the strategy is very
likely unsatisfied. We need to trigger regular compaction whenever
a resharding job replaces a shared sstable by an unshared sstable,
so that compaction will not fall way behind due to lots of new sstables
created by resharding process.
Fixes#5262.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200217144946.20338-1-raphaelsc@scylladb.com>
Merged patch series from Avi Kivity:
boost/multiprecision is a heavyweight library, pulling in 20,000 lines of code into
each header that depends on it. It is used by converting_mutation_partition_applier
and types.hh. While the former is easy to put out-of-line, the latter is not.
All we really need is to forward-declare boost::multiprecision::cpp_int, but that
is not easy - it is a template taking several parameters, among which are non-type
template parameters also defined in that header. So it's quite difficult to
disentangle, and fragile wrt boost changes.
This patchset introduces a wrapper type utils::multiprecision_int which _can_
be forward declared, and together with a few other small fixes, manages to
uninclude boost/multiprecision from most of the source files. The total reduction
in number of lines compiled over a full build is 324 * 23,227 or around 7.5
million.
Tests: unit (dev)
Ref #1https://github.com/avikivity/scylla uninclude-boost-multiprecision/v1
Avi Kivity (5):
converting_mutation_partition_applier: move to .cc file
utils: introduce multiprecision_int
tests: cdc_test: explicitly convert from cdc::operation to uint8_t
treewide: use utils::multiprecision_int for varint implementation
types: forward-declare multiprecision_int
configure.py | 2 +
concrete_types.hh | 2 +-
converting_mutation_partition_applier.hh | 163 ++-------------
types.hh | 12 +-
utils/big_decimal.hh | 3 +-
utils/multiprecision_int.hh | 256 +++++++++++++++++++++++
converting_mutation_partition_applier.cc | 188 +++++++++++++++++
cql3/functions/aggregate_fcts.cc | 10 +-
cql3/functions/castas_fcts.cc | 28 +--
cql3/type_json.cc | 2 +-
lua.cc | 38 ++--
mutation_partition_view.cc | 2 +
test/boost/cdc_test.cc | 6 +-
test/boost/cql_query_test.cc | 16 +-
test/boost/json_cql_query_test.cc | 12 +-
test/boost/types_test.cc | 58 ++---
test/boost/user_function_test.cc | 2 +-
test/lib/random_schema.cc | 14 +-
types.cc | 20 +-
utils/big_decimal.cc | 4 +-
utils/multiprecision_int.cc | 37 ++++
21 files changed, 627 insertions(+), 248 deletions(-)
create mode 100644 utils/multiprecision_int.hh
create mode 100644 converting_mutation_partition_applier.cc
create mode 100644 utils/multiprecision_int.cc
This reduces the number of translation units that depend on
boost/multiprecision from 354 to 30, and reduces the size of
database.i (as an example) from 406160 to 382933 (smaller
files will benefit more, relatively).
Ref #1
The goal is to forward-declare utils::multiprecision_int, something
beyond my capabilities for boost::multiprecision::cpp_int, to reduce
compile time bloat.
The patch is mostly search-and-replace, with a few casts added to
disambiguate conversions the compiler had trouble with.
After the varint data type starts using the new multiprecision_int type,
this code fails to compile. I expect that somehow the conversion from enum
class to cpp_int was allowed to succeed, and we ended up with a data_value
of type varint. The tests succeeded because the serialized representation
happened to be the same.
Previously we had stream_id_1 and stream_id_2 columns
of type long each. They were forming a partition key.
In a new format we want a single stream_id column that
forms a partition key. To be able to still store two
longs, the new column will have type blob and its value
will be concatenated bytes of two longs that
partition key is composed of.
We still want partition key to logically be two longs
because those two values will be used by a custom partitioner
later once we implement it.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
multiprecision_int is a wrapper around boost::multiprecision::cpp_int that adds
no functionality. The intent is to allow forward declration; cpp_int is so
complicated that just finding out what its true type is a difficult exercise, as
it depends on many internal declarations.
Because cpp_int uses expression templates, the implementation has to explicitly
cast to the desired type in many places, otherwise the C++ compile is presented
with too many choices, especially in conjunction with data_value (which can
convert from many different types too).
converting_mutation_partition_applier is a heavyweight class that is not
used in the hot path, so it can be safely out-of-lined. This moves
some includes to boost/multiprecision out of header files, where they
can infect a lot of code.
mutation_partition_view.cc's includes were adjusted to recover
missing dependencies.
The previous version errorneously used local db reference
which was propagated into another shard. This time carry
the sharded instance and use .local() as before.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200303221729.31261-1-xemul@scylladb.com>
I found that a few variables in cql_test_env were wrapping sharded in
shared_ptr for no apparent reason. These patches convert them to plain
sharded<...>.
Currently, the Dockerfile installs the latest version of Scylla. Let's
add a VERSION argument to Dockerfile, which explicitly specifies the
version to ensure scripts, for example, always build the expected
version. If no VERSION is specified for "docker build", use the default
value of "666.development", which is the version number for latest
nightly.
"These two patches were made suspect of failing next promotion and
excluded from the original series."
* 'test.py.log' of https://github.com/kostja/scylla:
test.py: remove log output on success unless -s is specified
test.py: do not store entire log output in junit report.
The introduction of rsyslog had two errors in it.
Both errors are non fatal and the docker still works,
however, the system is left in a wrong state in which
supervisord marks rsyslogd service as failed (after several
failed retry attempts). Another bug in the configuration
causes rsyslog to output an error.
1) An inclusion command from a newer version was used
in rsyslogs main configuration file. This caused to rsyslog
to complain during startup but it didn't do much damage since
rsyslog converts every unrecognised command to a message command.
2) in the supervisord definition of the service, rsyslogd is ran
without the -n option which means it defaults to automatically
switch to the background. Supervisord interpret this as an unexpected
process termination and retries to start the process (unsuccessfully
because rsyslog protects itself from having multiple processes of
itself) and eventually marks it as down although it is fully up and
running.
This commit fixes both configuration problems.
Tests: Build and run docker and validate the errors are gone.
Fixes#5937
Currently, you have to build the relocatable package tarball with
./reloc/build_reloc.sh to be able to build an RPM out of it. You need to
do this because RPMS require SHA1 build-ids, but the build system does
not enforce that.
To prepare for adding RPM target to the ninja build, let's switch to
SHA1 build ID conditionally, because the performance difference between
xxhash and SHA1 is neglible. Rafael Avila de Espindola writes:
[...] the sha1 implementation in current lld is pretty fast. Linking
release scylla the times I get are
lld in fedora
fast 2.83739
sha1 3.51990
current lld
fast 2.6936
sha1 2.90250
And the sha1 implementation might get even faster:
https://bugs.llvm.org/show_bug.cgi?id=44138.
Message-Id: <20200303131806.22422-1-penberg@scylladb.com>
This reverts commit c6ddd21c50.
Uses database& instance across shards, which causes repair writer to
use the table object from the wrong shard.
Fixes#5907
"
This set cleans sstable_writer_config and surrounding sstables
code from using global storage_ and feature_ service-s and database
by moving the configuration logic onto sstables_manager (that
was supposed to do it since eebc3701a5).
Most of the complexity is hidden around sstable_writer_config
creation, this set makes the sstables_manager create this object
with an explicit call. All the rest are consequences of this change.
Tests: unit(debug), manual start-stop
"
* 'br-clean-sstables-manager-2' of https://github.com/xemul/scylla:
sstables: Move get_highest_supported_format
sstables: Remove global get_config() helper
sstables: Use manager's config() in .new_sstable_component_file()
sstable_writer_config: Extend with more db::config stuff
sstables_manager: Don't use global helper to generate writer config
sstable_writer_config: Sanitize out some features fields initialization
sstable_writer_config: Factor out some field initialization
sstables: Generate writer config via manager only
sstables: Keep reference on manager
test: Re-use existing global sstables_manager
table: Pass sstable_writer_config into write_memtable_to_sstable
Change rjson::get() to take std::string_view, instead of RapidJson's
version of that type, "StringRef". We already did the same change for
rjson::find() in a previous patch.
Not only is std::string_view more convenient for potential callers in Scylla,
this change also avoids a bug in FindMember() on StringRef where the length
is ignored (and instead, null-termination of the string is assumed).
This patch doesn't require any changes to callers, because we actually
had just a handful of remaining callers (most call sites switched to
rjson::find()), and all of them used string constants which could be
implicitly converted to StringRef or std::string_view just the same.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200303161019.1456-1-nyh@scylladb.com>
This patch adds a rjson::remove_member() wrapper to the RemoveMember
method, which takes a std::string_view. But beyond the convenience, this
actually works around a subtle bug in RemoveMember where, if given a
StringRef parameter, ignores its length (see upstream issue
https://github.com/Tencent/rapidjson/issues/1649).
In the one place we used RemoveMember, it forced us to copy the string
because it wasn't null-terminated. The solution proposed here involves
wrapping the string view in a GenericValue - which no longer needs to copy
the string, but still works around the bug.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200303143524.28300-1-nyh@scylladb.com>
Our rjson::find() convenience function used RapidJson's "StringRef" type,
which is almost exactly like std::string_view. If we switch to use
string_view as we do in this patch, a lot of call sites become much simpler.
Moreover, there was an even more important motivation for this patch:
the RapidJson FindMember() function we used in rjson::find() has a bug when
given a StringRef - although a StringRef contains a length, the FindMember()
code ignores it and expects the string to be null-terminated (see:
https://github.com/Tencent/rapidjson/issues/1649). In this patch, we wrap
the pointer and length of a std::string_view in an rjson::value, a code path
which bypasses the FindMember bug, and yet does not require copying the
string.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200303141814.26929-1-nyh@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5940 from
Kamil Braun:
Add a bunch of new structs describing a change made
to a table, and an extract_changes function which takes a mutation and
returns the set of changes contained in this mutation, separated by
timestamp and ttl.
Add a split function which uses extract_changes to split a mutation into separate mutations, each describing a single change.
Static rows are put into separate changes now.
The pre_image_select function was fixed to select pre_image data always when
there is a static row/clustered row change, even if there were e.g. additional
range tombstones.
Fixes: #5719.
Tests: unit(dev)
When a node is overloaded requests usually start to queue up. Timeouts
are supposed to prevent queues from exploding and causing an OOM. One
prominent queue that tends to explode is the smp queue as it didn't
support timeouts and so requests would sit in the queue until the target
shard would process them. If the target shard is heavily overloaded
requests might accumulate faster then they are processed, surely leading
to an OOM.
To prevent this use the recently introduces timeout to
`seastar::smp::submit_to()` and derived APIs to time out write requests
sitting in the smp queue. We simply use the request's own timeout
for this purpose.
Fixes: #5055
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200303131658.741720-1-bdenes@scylladb.com>
If the mutation contains separate logical changes (e.g. with different
timestamps and/or ttls), it will be split into multiple mutations, each
passed into transform.
Previously we wouldn't retrieve the preimage if the mutation contained
something different than static/clustered row updates, e.g. if it
contained a partition deletion.
However, there are mutations created from batch statements which can
contain both a partition deletion and a set of row updates with a later
timestamp. We want to retrieve the preimage too in this case.
This commit introduces a bunch of new structs describing a change made
to a table, and an `extract_changes` function which takes a mutation and
returns the set of changes contained in this mutation, separated by
timestamp and ttl.
The function checks if there are multiple timestamps and/or ttls inside
a mutation, which means separate changes should be created for this
mutation in CDC.
Merged pull request https://github.com/scylladb/scylla/pull/5910 by
Calle Wilund:
Rename metadata and data columns according to new spec
Also use transformation methods for names in all code + tests
to make switching again easier
Break up data column tuple
Data column is now pure frozen original type.
If column is deleted (set to null), a metadata column cdc$deleted_ is set to true, to distinguish null column == not involved in row operation
For non-atomic collections, a cdc$deleted_elements_ column is added, and when removing elements from collection this is where they are shown.
For non-atomic assign, the "cdc$deleted_" is true, and is set to new value.
column_op removed.
The header sits in many other headers, but there's a handy
schema_fwd.hh that's tiny and contains needed declarations
for other headers. So replace shema.hh with schema_fwd.hh
in most of the headers (and remove completely from some).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200303102050.18462-1-xemul@scylladb.com>
According to "new" spec:
Data column is now pure frozen original type.
If column is deleted (set to null), a metadata column
cdc$deleted_<name> is set to true, to distinguish
null column == not involved in row operation
For non-atomic collections, a cdc$deleted_elements_<name>
column is added, and when removing elements from collection
this is where they are shown.
For non-atomic assign, the "cdc$deleted_<name>" is true,
and <name> is set to new value.
column_op removed.
`query_result_builder` is movable but if you actually try to move it
after it having consumed some fragments it will blow up in your face
when you try to use it again. This is because its `mutation_querier`
member received a reference to its `query::result::partition_writer`. Of
course the reference to the latter was invalidated on move so the former
accessed invalid memory. Since `query::result::partition_writer` wasn't
actually used for anything other, just move it into the
`mutation_querier`, making `query_result_builder` actually safe to move.
Fixes: #3158
Message-Id: <20190830142601.51488-1-bdenes@scylladb.com>
The function in question uses future-based .when_enabled() subscription
on cluster_supports_truncation_table feature. This method is considered
to be unsafe, so here's the patch that changes it onto feature::listener.
The completion of the migration is only awaited by a single test, so
this waiting mechanism is also slightly simplified.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Support LIKE operator condition on column expressions.
NOTE: following the existing code, the LIKE pattern value is converted
to raw bytes and passed straight as bytes_view to like_matcher
without type checking; it should be checked/sanitized by caller.
Refs: #5777
Branch URL: https://github.com/alecco/scylla/tree/as_like_condition_2
Tests: unit ({dev}), unit ({debug})
NOTE: fail for unrelated test test_null_value_tuple_floating_types_and_uuids
Extracting a certain element from a collection is a common task I have
to do while debugging cores. For certain collections (c-array,
std::array) this is trivial, for others it is easy enough (std::vector),
but for some (std::list) this is a tiresome work-intensive process.
This convenience function allows getting a reference to any element of
the supported container types, returning them for further use in the
interactive session.
Currently only `std::list` and `std::vector` are supported.
To be a generic convenience function for dereferencing all sorts of
smart pointers. For now `std::unique_ptr`, `seastar::lw_shared_ptr` and
`seastar::foreign_ptr` are supported.
`table` is not registered with the database, and hence will not be
waited on during shutdown.
Stop it explicitly to prevent any asynchronous operation on it racing
with shutdown.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200302142845.569638-1-bdenes@scylladb.com>
* Mark cql3::attributes::raw class as final
* Change every occurrence of ::shared_ptr<attributes::raw>
to std::unique_ptr<...>
* Mark all methods in cql3::attributes::raw as const
* Remove redundant "_attrs" ptr copy in insert_json_statement,
use one from raw::modification_statement
* Fix odd indentation in cql3/statements/update_statement.cc
Tests: unit-tests (dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200301223708.99883-1-pa.solodovnikov@scylladb.com>
_get_next() was recursively calling itself with index - 1 if index was
> 0. When we reached the desired element we always tried to use
member_types[0] as the type, which is incorrect since member_types
contains all types and doesn't change in get().
Fix by replacing recursion with iteration so that we keep the original
index.
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1582900804-18681-1-git-send-email-tgrabiec@scylladb.com>
When wrapping rapidjson routines with safer, yieldable code,
parsing information was lost, because the JSON reader was not
checked for parsing errors before further processing.
That resulted in nearly all parsing errors being reduced to
"Assertion failed: StackSize() != 1". After this patch,
all various errors (missing quotations, colons, object names,
etc.) are properly returned for the user.
Message-Id: <968ce2f7539bf33d3eb829f0ab431b788d291602.1583134221.git.sarna@scylladb.com>
When running the Alternator tests, we don't care about verifying the
pedigree of the SSL certificates - we actually know the ones we use
in our test setups are fake, and not signed by any respectable certificate
authority.
We already use "verify=False" in many requests to avoid the certificate
checking, but then we start getting scary-looking warning messages about
an "Unverified HTTPS request is being made.". There's a way to disable
these warnings, but we only did in some cases, and there were still some
tests that show these warnings. Let's do it once, in a way that affects
all tests.
Message-Id: <20200301175607.8841-1-nyh@scylladb.com>
When incoming mutation contains live row marker the `operation` is
described as "insert", not as an "update".
Also, I extended the test case "test_row_delete" with one insert,
which is expected to log different value of `operation` than update
or delete. Renamed the test case accordingly.
Test cases that relied on "update" being the same as "insert" are
updated accordingly (`test_pre_image_logging`, `test_cdc_across_shards`,
`test_add_columns`).
Fixes#5723
"
Timeouts defaulted to `db::no_timeout` are dangerous. They allow any
modifications to the code to drop timeouts and introduce a source of
unbounded request queue to the system.
This series removes the last such default timeouts from the code. No
problems were found, only test code had to be updated.
tests: unit(dev)
"
* 'no-default-timeouts/v1' of https://github.com/denesb/scylla:
database: database::query*(), database::apply*(): remove default timeouts
database: table::query(): remove default timeout
mutation_query: data_query(): remove default timeout
mutation_query: mutation_query(): remove default timeout
multishard_mutation_query: query_mutations_on_all_shards(): remove default timeout
reader_concurrency_semaphore: wait_admission(): remove default timeout
utils/logallog: run_when_memory_available(): remove default timeout
Both rapidjson library and DynamoDB induce enough corner cases
for incorrect JSON, that the simplest way out is to simply
conform back to ValidationException in all cases.
This commit comes with an updated test, which is now aware
of 3 possible outcomes for an incorrect JSON:
a ValidationException, a SerializationException and HTTP 404.
Message-Id: <5e39d2dc077f4ea5ce360035a4adcddaf3a342a0.1582876734.git.sarna@scylladb.com>
This reverts commit 65aadad9a6. It causes
crashes (due to the coredump test) during package install, since scylla_coredump_setup
is called from rpm postinstall. The test should be done only from scylla_setup (and
the user should be warned).
Fixes#5916.
"
As part of avoiding static initialization order problems I want to
switch a few global sstring to constexpr std::string_view. The
advantage being that a constexpr variable doesn't need runtime
initialization and therefore cannot be part of a static initialization
order problem.
In order to do the conversion I needed to convert a few APIs to use
std::string_view instead of sstring and const sstring&.
These patches are the simple cases that are also an improvement in
their own right.
"
* 'espindola/string_view' of https://github.com/espindola/scylla: (22 commits)
test: Pass a string_view to create_table's callback
Pass string_view to the schema constructor
cql3: Pass string_view to the column_specification constructor
Pass string_view to keyspace_metadata::new_keyspace
Pass string_view to the keyspace_metadata constructor
utils: Use std::string as keys in nonstatic_class_registry
utils: Pass a string_view to class_registry::to_qualified_class_name
auth: Return a string_view from authorizer::qualified_java_name
auth: Return a string_view from authenticator::qualified_java_name
utils: Pass string_view to is_class_name_qualified
test: Pass a string_view to create_keyspace
Pass string_view to no_such_column_family's constructor
perf_simple_query: Pass a string_view to make_counter_schema
Pass string_view to the schema_builder constructor
types: Add more data_value constructors
transport: Pass a string_view to cql_server::connection::make_autheticate
transport: Pass a string_view to cql_server::response::write_string
cql3: Pass std::string_view to query_processor::compute_id
cql3: Remove unused variable
cql3: Pass a string_view to cf_statement::prepare_keyspace
...
One of them uses global storage_proxy instance, but since
it is not used -- remove it not to encourage anybody to start
calling one.
Another call uses the db.find_keyspace to check if a keyspace
exists, while there's a nicer db.has_keyspace helper (which
doesn't throw exceptions) so use it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200228123644.13931-1-xemul@scylladb.com>
This moves sstring copies from the callers to the constructor
implementation.
While at it, move the implementation out-of-line.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The sstring::compare functions was never updated to work with
std::string_view. We could fix that, but it seems better to just
switch to std::string.
With a working compare function we can avoid copying the argument
passed to to_qualified_class_name when an entry is found in the map.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This gives more flexibility to the implementations as they now don't
need to construct a sstring.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This gives more flexibility to the implementations as they now don't
need to construct a sstring.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this we can construct a data_value from any string type. This
also avoids a few sstring copies.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This avoids a copy in the callers. While at it, also make this
function non-virtual since it is never overwritten.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This moves the string copy from the callers to the implementation of
to_internal_name.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
Reverse queries work by reading an entire partition into memory, then
start emitting its rows in reverse order. It is easy to see how this can
lead to disasters combined with large partitions. In fact a handful of
such reverse queries on large partitions is enough to bring a node down.
To prevent this, abort reverse queries, when we find out that the size
of the partition is larger than a limit. This might be annoying to users,
but I'm sure it is not as annoying as their nodes going down.
The limit is configurable via `max_memory_for_unlimited_query`
configuration option, which is 1MB by default. This limit is propagated
to each table, system tables having no limit. This limit is planned to
be used by other queries capable of consuming unlimited amount of
memory, like unpaged queries. Not in this series.
The proper solution would be to read the data in reverse (#1413), but
that is a major effort. In the meanwhile make sure the unsuspecting user
won't bring their nodes down with an innocent looking ordering
directive.
Note that for calculating the memory footprint of the
partition-in-question, only the clustering rows are used. This should be
fine, the 1MB limit is conservative enough that an eventual overshoot
caused by the omitted range tombstones and the static row would not make
a big difference.
Fixes: #5804
"
* 'limit-reverse-query-memory-consumption/v3' of https://github.com/denesb/scylla:
flat_mutation_reader: make_reversing_reader(): add memory limit
db/config: add config memory limit of otherwise unlimited queries
utils::updateable_value: add operator=(T)
flat_mutation_reader: expose reverse reader as a standalone reader
A static constructor was used to initialize update_row_query. That
constructor would call meta::roles_table::qualified_name() which would
access AUTH_KS which is also initialized by a static constructor in
another file, so the construction order is not guaranteed.
This change turns update_row_query into a function with a static local
variable in it. The static local is initialized at first use, fixing
the problem.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200227163916.19761-1-espindola@scylladb.com>
Merged patch series by Piotr Sarna:
This series makes json parsing yieldable in order
to prevent reactor stalls. It's done by:
1. Extracting the parsing stage out of alternator executor
2. Moving the parsing stage to a separate service,
which uses a static seastar thread (parallelism: 1)
3. Wrapping rjson parsing routines with a yieldable parser,
which takes advantage of running in a seastar thread
and occasionally performs maybe_yield()
Step 2 above is only used for JSON's big enough to potentially
create stalls - small requests will be parsed immediately,
without being redirected to a static thread.
Handling a PutItem operation with large JSONs
on my machine takes approximately:
1MB doc: ~30ms
3MB doc: ~90ms
12MB doc: ~350ms
out of which parsing itself is around:
1MB doc: ~7ms
3MB doc: ~20ms
12MB doc: ~80ms
(bonus: 400KiB doc: ~2ms)
; the document was a single object full of small items,
which triggers many allocations during parsing.
The above numbers were roughly the same before and after
the series, but the 12MB document did not cause reactor
stalls after the patch.
Note: writing the JSON can still be a source of stalls,
especially for large documents.
Note2: DynamoDB limits single value size to 400KiB,
but for batches it will be 16MiB total request size
Note3: If parallelism ever proves to be an issue,
it's easily increasable by spawning more static threads.
Refs: #5742
Tests: alternator(local)
manual
Piotr Sarna (12):
alternator: break lines in server callbacks
alternator: allow moving the request from rmw operation
alternator: move parsing in front of executor
alternator: convert parse to std::string_view
alternator: implement json parser inside the server
alternator: remove rjson::parse_raw
alternator: make rjson yieldable in thread context
alternator: fix returning raw JSON errors
alternator: change json errors class to SerializationException
alternator-test: rename large requests test to 'manual requests'
alternator-test: extract getting signed request helper
alternator-test: add tests for incorrect JSON documents
...ge_requests.py => test_manual_requests.py} | 53 +++--
alternator/executor.cc | 203 ++++++++----------
alternator/executor.hh | 33 +--
alternator/rjson.cc | 47 +++-
alternator/rjson.hh | 7 +-
alternator/rmw_operation.hh | 1 +
alternator/serialization.cc | 9 +-
alternator/server.cc | 111 ++++++++--
alternator/server.hh | 20 +-
9 files changed, 310 insertions(+), 174 deletions(-)
rename alternator-test/{test_large_requests.py => test_manual_requests.py} (70%)
When has_relevant_range_on_this_shard() found a relevant range, it will unnecessarily
iterate through the end. Verified manually that this could be thousands of pointless
iterations when streaming data to a node just added. The relevant code could be
simplified by de-futurizing it but I think it remains so to allow task scheduler
to preempt it if necessary.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200220224048.28804-2-raphaelsc@scylladb.com>
This test suite can then be the parent of tests which use custom,
potentially not validated input in order to test alternator
against data not easy to push via boto3 or Python, due to their
implementation details.
Previously, alternator server was not directly sharded - and instead
kept a helper http server control class, which stored sharded http
server inside. That design is confusing and makes it hard to expand
alternator server with new sharded attributes, so from now on
the alternator server is itself sharded<>.
Tests: alternator-test(local, smp==1&smp==4)
Fixes#5913
Message-Id: <b50e0e29610c0dfea61f3a1571f8ca3640356782.1582788575.git.sarna@scylladb.com>
In order to be consistent with DynamoDB - a parsing error on incorrect
JSON input is reported as SerializationException instead of
ValidationException.
A couple of places in executor code leaked raw JSON errors to the user
instead of formulating a proper ValidationException message.
These places are now fixed, and the next patch in this series will
act as a regression checker, since all JSON errors will be returned
as SerializationException, not ValidationException instances.
In order to fight reactor stalls, rjson parsing and writing
routines can now yield if they run in seastar thread context.
In order to run a yieldable version of the parser which needs
to be run in seastar thread context, use parse_yieldable()
instead of parse().
The json parser runs in a static thread which accepts and parses
documents. Documents smaller than a parsing threshold
(currently: 16KiB) will be parsed in place without yielding.
The assumption is that most alternator requests are small
and there's no need to parse them in a yieldable way,
which also induces overhead. For reference, parsing a 128KiB
document made of many small objects with rapidjson takes
around 0.5 millisecond, and a 16KiB document is parsed
in around 0.06ms - a value small enough not to disturb
Seastar's current value of 0.5ms task quota too much.
Parsing a request string into JSON happens as a first thing
in every request, so it can be performed before calling
any executor callbacks. The most important thing however,
is that making parsing a separate stage allows certain optimizations,
e.g. running all parsing in a single seastar thread, which allows
adding yields to rjson parsing later.
If the reversing requires more memory than the limit, the read is
aborted. All users are updated to get a meaningful limit, from the
respective table object, with the exception of tests of course.
We have a few kind of queries whose memory consumption is not limited at
all. One of these is reverse queries, which reads entire partitions into
memory, before reversing them. These partitions can be larger than
memory and thus such a query can single-handedly cause OOM.
This patch introduces a configuration for a memory limit for such
queries. This will serve as a hard limit and queries which attempt to
use more memory than this, will be aborted.
The limit is propagated to table objects, with the intention of keeping
system tables unlimited. These tables are usually small and initiators
of system queries are not prepared for failures.
Currently reverse reads just pass a flag to
`flat_mutation_reader::consume()` to make the read happen in reverse.
This is deceptively simple and streamlined -- while in fact behind the
scenes a reversing reader is created to wrap the reader in question to
reverse partitions, one-by-one.
This patch makes this apparent by exposing the reversing reader via
`make_reversing_reader()`. This now makes how reversing works more
apparent. It also allows for more configuration to be passed to the
reversing reader (in the next patches).
This change is forward compatible, as in time we plan to add reversing
support to the sstable layer, in which case the reversing reader will
go.
No reason to disallow this. We still forbid mixing LIKE and non-LIKE
relations on the same column.
Fixes#5902.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
The CMake build system in seastar.git exports the package to CMake
package registry. However, we don't use it when building from scylla.git
(we link to seastar directly) and get the following warning when
building with "dbuild" (that does not bind mount $HOME/.cmake):
CMake Warning at CMakeLists.txt:1180 (export):
Cannot create package registry file:
/home/penberg/.cmake/packages/Seastar/3b6ede62290636bbf1ab4f0e4e6a9e0b
No such file or directory
Let's just disable the package registry for our builds by setting the
CMAKE_EXPORT_NO_PACKAGE_REGISTRY CMake option as discussed here to make
the warning go away:
https://cmake.org/cmake/help/v3.4/variable/CMAKE_EXPORT_NO_PACKAGE_REGISTRY.html
Message-Id: <20200227092743.27320-1-penberg@scylladb.com>
To install scylla using install.sh easily, we need to run following things:
- add scylla user/group
- configure scylla.yaml
- run scylla_post_install.sh
But we don't want to run them when we build .rpm/.deb package,
we also need to add --packaging option to skip them.
Fixes#5830
Instead of keeping the LIKE pattern in a restriction object (as we
currently do), keep it in like_matcher. Also move the
pattern-idempotence check from the restriction to the matcher.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
Here is a simple introduction to the node operations scylla supports and
some of the issues.
- Replace operation
It is used to replace a dead node. The token ring does not change. It
pulls data from only one of the replicas which might not be the
latest copy.
- Rebuild operation
It is used to get all the data this node owns form other nodes. It
pulls data from only one of the replicas which might not be the
latest copy.
- Bootstrap operation
It is used to add a new node into the cluster. The token ring
changes. Do no suffer from the "not the latest replica” issue. New
node pulls data from existing nodes that are losing the token range.
Suffer from failed streaming. We split the ranges in 10 groups and we
stream one group at a time. Restream the group if failed, causing
unnecessary data transmission on wire.
Bootstrap is not resumable. Failure after 99.99% of data is streamed.
If we restart the node again, we need to stream all the data again
even if the node already has 99.99% of the data.
- Decommission operation
It is used to remove a live node form the cluster. Token ring
changes. Do not suffer “not the latest replica” issue. The leaving
node pushes data to existing nodes.
It suffers from resumable issue like bootstrap operation.
- Removenode operation
It is used to remove a dead node out of the cluster. Existing nodes
pulls data from other existing nodes for the new ranges it own. It
pulls from one of the replicas which might not be the latest copy.
To solve all the issues above. We could use repair based node operation.
The idea behind repair based node operations is simple: use repair to
sync data between replicas instead of streaming.
The benefits:
- Latest copy is guaranteed
- Resumable in nature
- No extra data is streamed on wire
E.g., rebuild twice, will not stream the same data twice
- Unified code path for all the node operations
- Free repair operation during bootstrap, replace operation and so on.
Fixes: #3003Fixes: #4208
Tests: update_cluster_layout_tests.py + replace_address_test.py + manual test
"
* 'repair_for_node_ops' of https://github.com/asias/scylla:
docs: Add doc for repair_based_node_ops
storage_service: Enable node repair based ops for bootstrap
storage_service: Enable node repair based ops for decommission
storage_service: Enable node repair based ops for replace
storage_service: Enable node repair based ops for removenode
storage_service: Enable node repair based ops for rebuild
storage_service: Use the same tokens as previous bootstrap
storage_service: Add is_repair_based_node_ops_enabled helper
config: Add enable_repair_based_node_ops
repair: Add replace_with_repair
repair: Add rebuild_with_repair
repair: Add do_rebuild_replace_with_repair
repair: Add removenode_with_repair
repair: Add decommission_with_repair
repair: Add do_decommission_removenode_with_repair
repair: Add bootstrap_with_repair
repair: Introduce sync_data_using_repair
repair: Propagate exception in tracker::run
* seastar 7a3b4b4e4e...affc3a5107 (6):
> Merge "Add the possibility to remove rules from routes" from Pavel
> stall_detector: expose correct clock type to use
> queue: add has_blocked_consumer() function
> Merge "core: reduce memory use for idle connections" from Avi
> testing: Enable abort_on_internal_error on tests
> core: Add a on_internal_error helper
When we test Alternator on its HTTPS port (i.e., pytest --https),
we don't want requests to verify the pedigree of the SSL certificate.
Our "dynamodb" fixture (conftest.py) takes care of this for most of
the tests, but a few tests create their own requests and need to pass the
"verify=False" option on their own. In some tests, we forgot to do
this, and this patch fixes three tests which failed with "pytest --https".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200226142330.27846-1-nyh@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5897
from Juliusz Stasiewicz:
Column operation now contains operation::row_delete (== 2)
after queries like delete from tbl where pk=x and ck=y;. Before
this patch row deletes were treated as updates, which was incorrect
because updates do not contain row tombstones (and row deletes do).
Refs #5709
Column `operation` now contains `operation::row_delete` (== 2)
after queries like `delete from tbl where pk=x AND ck=y;`. Before
this patch row deletes were treated as updates, which was incorrect
because updates do not contain row tombstones (and row deletes do).
Refs #5709
Merged patch series from Piotr Sarna:
Alternator shutdown routines were only registered in main.cc,
but it's not enough - other operations, like decommision,
also rely on shutting down client servers.
In order to remedy the situation, a notion of client shutdown
listeners is introduced to storage service.
A shutdown listener implements a callback used by the storage
service when client servers need to shut down, and at the same
time it does not force storage service to keep a reference
for the client service itself.
NOTE: the interface can also be used later to provide
proper shutdown routines for redis and any other future APIs.
Fixes#5886
Tests: alternator-test(local, including a shutdown during the run)
Piotr Sarna (4):
storage_service: make shutdown_client_servers() thread-only
storage_service: add client shutdown hook
main: make alternator shutdown hook-based
main: reduce scope of alternator services
main.cc | 18 +++++++++---------
service/storage_service.cc | 22 +++++++++++++++++-----
service/storage_service.hh | 15 ++++++++++++++-
3 files changed, 40 insertions(+), 15 deletions(-)
On some environment systemd-coredump does not work with symlink directory,
we can use bind-mount instead.
Also, it's better to check systemd-coredump is working by generating coredump.
Fixes#5753
With the new shutdown routines in place, alternator executor
and server do not need to be declared outside of the `if` clause
which conditionally sets up alternator.
In order to properly handle not only shutdown, but also
decommission, drain and similar operations, alternator
shutdown is now registered as a client shutdown hook,
which allows storage service to trigger its shutdown routines.
Fixes#5886
The shutdown hook interface can be used later by additional
client interfaces (e.g. alternator, redis) to register
shutdown routines for various operations: Scylla shutdown,
node decommission, drain, etc. It also decouples
the services themselves from being part of the storage
service, since it's huge enough as it is.
Until now, PutItem or UpdateItem could be used to insert almost any JSON
as an attribute's value - even those that do not match DynamoDB's typed
value specification.
Among other things, the new validation allows us to reject empty sets,
strings or byte arrays - which are (somewhat artificially) forbidden in
DynamoDB.
Also added tests for the empty sets, strings and byte arrays that should
be rejected.
Fixes#5896
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200225150525.4926-1-nyh@scylladb.com>
DynamoDB does not support empty sets. Operations which remove elements
from a set attribute should remove the attribute when the last item is
removed - not leave an empty set as it incorrectly does now.
Incidentally, the same patch fixes another bug - deleting elements from
a non-existent set attribute should be allowed (and do nothing), not fail
as it does now.
This patch also includes tests for both bugs.
Fixes#5895
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200225125343.31629-1-nyh@scylladb.com>
We have not yet implemented the DELETE-with-value and ADD operations in
UpdateItem's old-style "AttributeUpdates" parameter - see issue #5864
and issue #5893, respectively
This patch include comprehensive tests for both features. The new tests
pass on DynamoDB, but currently xfails on Alternator - until these
features will be implemented.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200225105546.25651-1-nyh@scylladb.com>
Currenly `get_text_range()` uses heuristics about which ELF section
actually contains the text for the main executable. It appears that this
fails from time-to-time and we have to adjust the heuristics.
We don't really have to guess however, a much better method of
determining the section hosting text is to find a vtable pointer and
locate the section it resides in. For this, we use the
`reactor::_backend` as a canary. When this is not available, we fall
back to the pre-existing heuristics.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200225164719.114500-1-bdenes@scylladb.com>
Fixes#5669
This implements non-atomic collection and UDT handling for
both cdc preimage + delta.
To be able to express deltas in a meaningful way (and reconstruct
using it), non-atomic values are represented somewhat
differently from regular values:
* maps - stored as is (frozen)
* sets - stored as is (frozen)
* lists - stored as map<timeuuid, value> (frozen)
this allows reconstructing the list, as otherwise
things like list[0] = value cannot be represented
in a meaningful way
* udt - stored as tuple<tuple<field0>, tuple<field1>...> (frozen)
UDTs are normally just tuples + metadata, but we need to
distinguish the case of outer tuple element == null, meaning
"no info/does not partake in mutation" from tuple element
being a tuple(null) (i.e. empty tuple), meaning "set field to
null"
* seastar 8b6bc659c7...7a3b4b4e4e (3):
> Merge "Add custom stack size to seastar threads" from Piotr
Ref #5742.
> expiring_fifo: Optimize memory usage for single-element lists
Ref #4235.
> Close connection, when reach to max retransmits
The global get_highest_supported_format helper and its declaration
are scattered all over the code, so clean this up and prepare the
ground for moving _sstables_format from the storage_service onto
the sstables_manager (not this set).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Finally, the thing is not used by anyone and can be removed.
This greatly relaxes the sstables -> storage_service dependency.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Applauded-by: Benny Halevy <bhalevy@scylladb.com>
This is the last place left that calls for global get_config(),
switch it onto _sst_manager.config().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The enable_sstable_key_validation and summary_bytes_cost are used
in sstables writing code, keeping them on sstable_writer_config
removes more calls to global get_config().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The main goal of this patch is to stop using get_config() glbal
when creating the sstable_writer_config instance.
Other than being global the existing get_config() is also confusing
as it effectively generates 3 (three) sorts of configs -- one for
scylla, when db config and features are ready, the other one for
tests, when no storage service is at hands, and the third one for
tests as well, when the storage service is created by test env
(likely intentionally, but maybe by coincidence the resulting config
is the same as for no-storage-service case).
With this patch it's now 100% clear which one is used when. Also
this makes half the work of removing get_config() helper.
The db::config and feature_service used to initialize the managers
are referenced by database that creates and keeps managers on,
so the references are safe.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similar to previous patch -- initialize config fields from features
in configurator, not in default initializers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The promoted_index_block_size is taken from db config in two places.
Factor this out and, at the same time, stop keeping it as std::optional.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable_writer_config creation looks simple (just declare
the struct instance) but behind the scenes references storage
and feature services, messes with database config, etc.
This patch teaches the sstables_manager generate the writer
config and makes the rest of the code use it. For future
safety by-hands creation of the sstable_writer_config is
prohibited.
The manager is referenced through table-s and sstable-s, but
two existing sstables_managers live on database object, and
table-s and sstable-s both live shorter than the database,
this reference is save.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is needed for further patching. The sstables_manager outlives
all sstables objects, so it's safe.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstables_manager in scylla binary outlives the sstables objects
created by it, this makes it possible to add sstable->manager reference
and use it. In unit tests there are cases when sstables::test_env that
keeps manager in _mgr field is destroyed right after sstable creation
(e.g. -- in the boost/sstable_mutation_test.cc ka_sst() helper).
Fix this by chaning the _mgr being reference on the manager and
initialize it with already existing global manager. Few exceptions
from this rule that need to set own large data handler will create
the sstable_manager their own.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The latter creates the config by hands, but the plan is to
create it via sstables_manager. Callers of this helper are the
final frontiers where the manager will be safely accessible.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
- Bootstrap operation
It is used to add a new node into the cluster. The token ring changes.
Do not suffer from the "not the latest replica” issue. New node pulls
data from existing nodes that are losing the token range.
Suffer from failed streaming. We split the ranges in 10 groups and we
stream one group at a time. Restream the group if failed, causing
unnecessary data transmission on wire.
Bootstrap is not resumable. Failure after 99.99% of data is streamed.
If we restart the node again, we need to stream all the data again even
if the node already has 99.99% of the data.
Fixes: #3003Fixes: #4208
Tests: update_cluster_layout_tests.py + replace_address_test.py + manual test
- Decommission operation
It is used to remove a live node form the cluster. Token ring
changes. Do not suffer “not the latest replica” issue. The leaving
node pushes data to existing nodes.
Fixes: #3003Fixes: #4208
Tests: update_cluster_layout_tests.py + replace_address_test.py + manual test
- Replace operation
It is used to replace a dead node. The token ring does not change. It
pulls data from only one of the replicas which might not be the
latest copy.
Fixes: #3003Fixes: #4208
Tests: update_cluster_layout_tests.py + replace_address_test.py + manual test
This patch adds a warning of deprecation to DTCS. In a follow up step,
we will start requiring a flag for it to be enabled to make sure users
notice.
For now we'll just be nice and add a warning for the log watchers.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200224164405.9656-1-glauber@scylladb.com>
Since we set 'eth0' as default NIC name, we get following error when running scylla_setup in non-interactive mode without --nic parameter:
$ sudo scylla_setup --setup-nic-and-disks --no-raid-setup --no-verify-package --no-io-setup
NIC eth0 doesn't exist.
It looks strange since user actually does not specified 'eth0', they might forget to specify --nic.
I think we should shows up usage, when eth0 is not available on the system.
Fixes#5828
Changes the name of storage_proxy::mutate_hint_from_scratch function to
another name, whose meaning is more clear: send_hint_to_all_replicas.
Tests: unit(dev)
It seems like *.service is conflicting on install time because the file
installed twice, both debian/*.service and debian/scylla-server.install.
We don't need to use *.install, so we can just drop the line.
Fixes#5640
Aggregate functions on counters do not exist. Until now counters
could, at best, fall back to blob->blob overloads, e.g.:
```
cqlsh> select max(cnt) from ks.tbl;
system.max(cnt)
----------------------
0x000000000000000a
(1 rows)
cqlsh> select sum(entities) from ks.tbl;
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Invalid call to function sum, none of its type signatures match
[...]
```
Meanwhile, counters are compatible with bigints (aka. `long_type'),
so bigint overloads can be used on them (e.g. sum(bigint)->bigint).
This is achieved here by a special rule in overload resolution, which
makes `selector' perceive counters as an `EXACT_MATCH' to counter's
underlying type (`long_type', aka. bigint).
Until now, attempts to print counter update cell would end up
calling abort() because `atomic_cell_view::value()` has no
specialized visitor for `imr::pod<int64_t>::basic_view<is_mutable>`,
i.e. counter update IMR type. Such visitor is not easy to write
if we want to intercept counters only (and not all int64_t values).
Anyway, linearized byte representation of counter cell would not
be helpful without knowing if it consists of counter shards or
counter update (delta) - and this must be known upon `deserialize`.
This commit introduces simple approach: it determines cell type on
high level (from `atomic_cell_view`) and prints counter contents by
`counter_cell_view` or `atomic_cell_view::counter_update_value()`.
Fixes#5616
By default, `/usr/lib/rpm/find-debuginfo.sh` will temper with
the binary's build-id when stripping its debug info as it is passed
the `--build-id-seed <version>.<release>` option.
To prevent that we need to set the following macros as follows:
unset `_unique_build_ids`
set `_no_recompute_build_ids` to 1
Fixes#5881
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The official documentation language of Scylla is English, not French.
So correct the word "existant", which appeared several times throughout
Alternator's tests, to "existent".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200221224221.31237-6-nyh@scylladb.com>
This patch completes the support for the ReturnValues parameter for
the UpdateItem operation. This parameter has five settings - NONE, ALL_OLD,
ALL_NEW, UPDATED_OLD and UPDATED_NEW. Before this patch we already
supported NONE and ALL_OLD - and this patch completes the support for the
three remaining modes: ALL_NEW, UPDATED_OLD and UPDATED_NEW.
The patch also continues to improve test_returnvalues.py with additional
corner cases discovered during the development. After this patch, only
one xfailing test remains - testing updates to nested document paths,
which we do not yet support (even without the ReturnValues parameter).
After this patch, the support of ReturnValues is complete - for all
operations (UpdateItem, PutItem and DeleteItem) and all of its possible
settings.
Fixes#5053
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200221224221.31237-5-nyh@scylladb.com>
The rjson::set_with_string_name() utility function copies the given
string into the JSON key. The existing implementation required that this
input string be an std::string&, but a std::string_view would be fine too,
and I want to use it in new code to avoid yet another unnecessary copy.
Adding the overloads also exposes a few places where things were
implicitly converted to std::string and now cause an ambiguity - and
clearing up this ambiguity also allowed me to find places where this
conversion was unnecessary.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200221224221.31237-4-nyh@scylladb.com>
UpdateItem operations usually need to add a row marker:
* An empty UpdateItem is supposed to create a new empty item (row).
Such an empty item needs to have a row marker.
* An UpdateItem to add an attribute x and then later an UpdateItem
to remove this attribute x should leave an empty item behind.
This means the first UpdateItem needed to add a row marker, so
it will be left behind after the second UpdateItem.
So the existing code always added a row marker in UpdateItem.
However, there is one case where we should NOT create the row marker:
When the UpdateItem operation only has attribute deletions, and nothing
else, and it is applied to a key with no pre-existing item, DynamoDB
does not create this item. So neither should we.
This patch includes a new test for this test_update_item_non_existent,
which passes on DynamoDB, failed on Alternator before this patch, and
passes after the patch.
Fixes#5862.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200221224221.31237-3-nyh@scylladb.com>
In issue #5698 I raised a theory that we might have a bug when
BatchWriteItem is given two writes to the *same* key but in two different
tables. The test added here verifies that this theory was wrong, and
this case already works correctly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200221224221.31237-2-nyh@scylladb.com>
This series adds an option to the API that supports deleting
a specific table from a snapshot.
The implementation works in a similar way to the option
to specify specific keyspaces when deleting a snapshot.
The motivation is to allow reducing disk-space when using
the snapshot for backup. A dtest PR is sent to the dtest
repository.
Fixes#5658
Original PR #5805
Tests: (database_test) (dtest snapshot_test.py:TestSnapshot.test_cleaning_snapshot_by_cf)
* amnonh/delete_table_snapshot:
test/boost/database_test: adopt new clear_snapshot signature
api/storage_service: Support specifying a table when deleting a snapshot
storage_service: Add optional table name to clear snapshot
* amnonh/delete_table_snapshot:
test/boost/database_test: adopt new clear_snapshot signature
api/storage_service: Support specifying a table when deleting a snapshot
storage_service: Add optional table name to clear snapshot
The error message (silently) changed to "DB index is out of range" the
following commit:
c7a4e694ad
The new error message is part of Redis 4.0, released in 2017, so let's
switch Scylla to use the new one.
Message-Id: <20200211133946.746-1-penberg@scylladb.com>
The patch 759752947b explains why the .column_family method
of this statament implementation must be tuned to calculate
the column_family in some cases. However, to do this the global
storage_proxy is needed.
The proposal is to calculate the column_family in .validate
method, like it's done e.g. for function_statement-s, which
has storage_proxy reference at hands.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Get rid of numerous calls to get_local_stroage_proxy().get_db()
and use the storage proxy argument that's already avaliable in
most of them.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
- Removenode operation
It is used to remove a dead node out of the cluster. Existing nodes
pulls data from other existing nodes for the new ranges it own. It
pulls from one of the replicas which might not be the latest copy.
Fixes: #3003Fixes: #4208
Tests: update_cluster_layout_tests.py + replace_address_test.py + manual test
- Rebuild operation
It is used to get all the data this node owns form other nodes. It
pulls data from only one of the replicas which might not be the
latest copy.
Fixes: #3003Fixes: #4208
Tests: update_cluster_layout_tests.py + replace_address_test.py + manual test
With repair based node operations, we can resume previous failed
bootstrap. In order to do that, we need the bootstrap node uses the same
tokens as previous bootstrap.
Currently, we always use new tokens when we bootstrap, because we need
to stream all the ranges anyway. It does not matter if we use the same
tokens or not.
The rebuild and replace operations are similar because the token ring
does not change for both of them. Add a common helper to do rebuild and
replace with repair. It will be used by rebuild and replace operation
shortly.
It is used to sync data for node operations like bootstrap, decommission
and so on.
Unlike plain repair operation, the user of sync_data_with_repair() can
pass repair_neighbors object to specify the pre-calculated neighbors for
a range. If a mandatory neighbor is not available, the repair will fail
so that the upper layer can fail the node operation.
The original idea of prefixing alternator keyspace names with 'a#'
leveraged the fact that '#' is not a legal CQL character for keyspace
names. The idea is flawed though, since '#' proved to confuse
existing Scylla tools (e.g. nodetool).
Thus, the prefix is changed to more orthodox 'alternator_'.
It is possible to create such keyspaces with CQL as well, but then
the alternator CreateTable request would simply fail, because
the keyspace already exists, which is graceful enough.
Hiding alternator keyspaces and tables from CQL is another issue,
but there are other ways to distinguish them than a non-standard
prefix, e.g. tags.
Fixes#5883
The set_config registers lambdas that need db.local(), so
these routes must be registered after database is started.
Fixes: #5849
Tests: unit(dev), manual wget on API
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200219130654.24259-1-xemul@scylladb.com>
On Debian, we don't add xfsprogs/mdadm on package dependency, install on
scylla_raid_setup script instead.
Since xfsprogs/mdadm only needed for constructing RAID, we can move
dependencies to scylla_raid_setup too.
In order to decrease the developer's time spent on waiting
for boto3 to retry the request many times, the retry count
is configured to be 3.
Two major benefits:
- vastly decrease wait time when debugging a failing test
- for requests which are expected to fail, but return results
not compatible with boto3, execution time is decreased
Tests: alternator-test(local,remote)
Message-Id: <46a3a9344d9427df7ea55c855f32b8f0e39c9b79.1582285070.git.sarna@scylladb.com>
The nr_ranges_streamed denotes the number of ranges streamed
so far, but by the time the sending lambda is called this
counter is already incremented by the number of ranges to be
streamed in this call. And the variable is not used for
anything else but logging.
Fix this by swapping logging with incrementing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200221101601.18779-1-xemul@scylladb.com>
This patch adds an alternative way to locate sstables by looking at
sstable sets in table objects:
scylla sstables -t
This may be useful for several things. One is to identify sstables
which are not attached to tables.
Another use case is to be able to use the command on older versions of
scylla which don't have sstable tracking.
Message-Id: <1582308099-24563-1-git-send-email-tgrabiec@scylladb.com>
I neither is used, we get the default behavior: only release is built
without stack guards.
With --disable-stack-guards all modes are built without stack guards.
With --enable-stack-guards all modes are built with stack guards.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200222012732.992380-1-espindola@scylladb.com>
The variable in question was used to check that the bootstrap mode
finishes correctly, but it was removed, becase this check was for
self-evident code and thus useless (dbca327b)
Later, the patch was reverted to keep track the bootstrap mode for
API is_cleanup_allowed call (a39c8d0e)
This patch is a reworked combination of both -- the variable is
kept for API sake, but in a much simpler manner.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200221101813.18945-1-xemul@scylladb.com>
The _scheduled_gossip_task timer needs token_metadata and thus should
be stopped before. However, this is not always the case.
The timer is armed in start_gossiping, which is called by storage_service
init_server_without_the_messaging_service_part, and is canceled inside
stop_gossiping, which in turn is called by drain_on_shutdown, which in
turn is registered too late.
If something fails between the internals of the init_server_... and
defered registration of drain_on_shutdown (lots of reasons) the timer is
not stopped and may run, thus accessing the freed token_metadata.
Bandaid this by scheduling stop_gossiping right after the gossiper
instances are created. This can be too early (before storage_service
starts gossiping) or too late (after drain_on_shutdown stops it), but
this function is re-entrable.
Fixes#5844
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200221085226.16494-1-xemul@scylladb.com>
* seastar 2b510220...cdda3051 (10):
> core: discard unused variable / function
> pollable_fd: use boost::intrusive_ptr rather than std::unique_ptr for lifecycle management
> build: check for pthread_setname_np()
> build: link against Threads::Threads
> future: Avoid recursion in do_for_each
> future: Expand description of parallel_for_each
> merge: Add content length limit to httpd
> tests/scheduling_group_test: verify current scheduling group is inherited as expected
> net: return future<> instead of subscription<>
> cmake: be more verbose when looking for libraries
Before this patch, we only supported the ReturnValues=NONE setting of the
PutItem, UpdateItem and DeleteItem operations.
This patch also adds full support for the ReturnValues=ALL_OLD option
in all three operation. This option directs Alternator to return the full
old (i.e., pre-modification) contents of the item.
We implement this as a RMW (read-modify-write) operation just as we do
other RMW operations - i.e., by default we use LWT, to ensure that we really
return the value of the item directly before the modification, the same
value that would have been used in a conditional expression if there was one.
NOTE: This implementation means one cannot use ReturnValues=ALL_OLD in
forbid_rmw write isolation mode. One may theorize that if we only need the
read-before-write for ReturnValues and not for a conditional expression,
it should have been enough to use a separate read (as we do in unsafe_rmw
isolation mode) before the write. But we don't have this "optimization" yet
and I'm not sure it's a valid optimization at all - see discussion in
a new issue #5851.
This patch completes the ReturnValues support for the PutItem and DeleteItem
operations. However, the third operation, UpdateItem, supports three more
ReturnValues modes: UPDATED_OLD, ALL_NEW and UPDATED_NEW. We do not yet
support those in this patch. If a user tries to use one of these three modes,
an informative error message will be returned. The three tests for these
three unimplemented settings continue to xfail, but the rest of the tests
in test_returnvalues.py (except one test of nested attribute paths) now
pass so their xfail flag is dropped.
Refs #5053
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200219135658.7158-1-nyh@scylladb.com>
Row cache needs to be invalidated whenever data in sstables
changes. Cleanup removes data from sstables which doesn't belong to
the node anymore, which means cache must be invalidated on cleanup.
Currently, stale data can be returned when a node re-owns ranges which
data are still stored in the node's row cache, because cleanup didn't
invalidate the cache."
Fixes#4446.
tests:
- unit tests (dev mode)
- dtests:
update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_decommission_node_2_test
cleanup_test.py
* Pass raw::select_statement::parameters as lw_shared_ptr
* Some more const cleanups here and there
* lists,maps,sets::equals now accept const-ref to *_type_impl
instead of shared_ptr
* Remove unused `get_column_for_condition` from modification_statement.hh
* More methods now accept const-refs instead of shared_ptr
Every call site where a shared_ptr was required as an argument
has been inspected to be sure that no dangling references are
possible.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200220153204.279940-1-pa.solodovnikov@scylladb.com>
Due to a bug the entire segment is written in one huge write of 32Mb.
The idea was to split it to writes of 128K, so fix it.
Fixes#5857
Message-Id: <20200220102939.30769-1-gleb@scylladb.com>
There may be other commitlog writes waiting for zeroing to complete, so
not using proper scheduling class causes priority inversion.
Fixes#5858.
Message-Id: <20200220102939.30769-2-gleb@scylladb.com>
Row cache needs to be invalidated whenever data in sstables changes. Cleanup removes
data from sstables which doesn't belong to the node anymore, which means cache must
be invalidated on cleanup.
Currently, stale data can be returned when a node re-owns ranges which data are still
stored in the node's row cache, because cleanup didn't invalidate the cache.
To prevent data that belongs to the node from being purged from the row cache, cleanup
will only invalidate the cache with a set of token ranges that will not overlap with
any of ranges owned by the node.
update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_decommission_node_2_test
now passes.
Fixes#4446.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This procedure will calculate ranges for cache invalidation by subtracting
all owned ranges from the sstables' partition ranges. That's done so as
to reduce the size of invalidated ranges.
Refs #4446.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In order to avoid stack overflow issues represented by the attached
test case, rapidjson's parser now has a limit of nested level.
Previous iterations of this patch used iterative parsing
provided by rapidjson, but that solution has two main flaws:
1. While parsing can be done iteratively, printing the document
is based on a recursive algorithm, which makes the iteratively
parsed JSON still prone to stack overflow on reads.
Documents with depth 35k were already prone to that.
2. Even if reading the document would have been performed iteratively,
its destruction is stack-based as well - the chain of C++ destructors
is called. This error is sneaky, because it only shows with depths
around 100k with my local configuration, but it's just as dangerous.
Long story short, capping the depth of the object to an arguably large
value (39) was introduced to prevent stack overflows. Real life
objects are expected to rarely have depth of 10, so 39 sounds like
a safe value both for the clients and for the stack.
DynamoDB has a nesting limit of 32.
Fixes#5842
Tests: alternator-test(local,remote)
Message-Id: <b083bacf9df091cc97e4a9569aad415cf6560daa.1582194420.git.sarna@scylladb.com>
This patch causes inclusive and exclusive range deletes to be
distinguished in cdc log. Previously, operations `range_delete_start`
and `range_delete_end` were used for both inclusive and exclusive bounds
in range deletes. Now, old operations were renamed to
`range_delete_*_inclusive`, and for exclusive deletes, new operations
`range_delete_*_exclusive` are used.
Tests: unit(dev)
User reported an issue that after a node restart, the restarted node
is marked as DOWN by other nodes in the cluster while the node is up
and running normally.
Consier the following:
- n1, n2, n3 in the cluster
- n3 shutdown itself
- n3 send shutdown verb to n1 and n2
- n1 and n2 set n3 in SHUTDOWN status and force the heartbeat version to
INT_MAX
- n3 restarts
- n3 sends gossip shadow rounds to n1 and n2, in
storage_service::prepare_to_join,
- n3 receives response from n1, in gossiper::handle_ack_msg, since
_enabled = false and _in_shadow_round == false, n3 will apply the
application state in fiber1, filber 1 finishes faster filber 2, it
sets _in_shadow_round = false
- n3 receives response from n2, in gossiper::handle_ack_msg, since
_enabled = false and _in_shadow_round == false, n3 will apply the
application state in fiber2, filber 2 yields
- n3 finishes the shadow round and continues
- n3 resets gossip endpoint_state_map with
gossiper.reset_endpoint_state_map()
- n3 resumes fiber 2, apply application state about n3 into
endpoint_state_map, at this point endpoint_state_map contains
information including n3 itself from n2.
- n3 calls gossiper.start_gossiping(generation_number, app_states, ...)
with new generation number generated correctly in
storage_service::prepare_to_join, but in
maybe_initialize_local_state(generation_nbr), it will not set new
generation and heartbeat if the endpoint_state_map contains itself
- n3 continues with the old generation and heartbeat learned in fiber 2
- n3 continues the gossip loop, in gossiper::run,
hbs.update_heart_beat() the heartbeat is set to the number starting
from 0.
- n1 and n2 will not get update from n3 because they use the same
generation number but n1 and n2 has larger heartbeat version
- n1 and n2 will mark n3 as down even if n3 is alive.
To fix, always use the the new generation number.
Fixes: #5800
Backports: 3.0 3.1 3.2
Previously we required MODIFY permissions on all materialized views in
order to modify a table. This is wrong, because the views should be
synced to the table unconditionally. For the same reason,
users *shouldn't* be granted MODIFY on views, to prevent them manually
changing (and breaking) a view.
This patch removes an explicit permissions check in
modification_statement introduced by 65535b3. It also tests that a
user can indeed modify a table they are allowed to modify, regardless
of lacking permissions on the table's views and indices.
Fixes#5205.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
This series ensures the server more often than not initializes
raw_cql_statement, a variable responsible for holding the original
CQL query, and adds logging events to all places executing CQL,
and logs CQL text in them.
A prepared statement object is the third incarnation of
parser output in Scylla:
- first, we create a parsed_statement descendent.
This has ~20 call sites inside Cql.g
- then, we create a cql_statement descendent, at ~another 20 call sites
- finally, in ~5 call sites we create a prepared statement object,
wrapping cql_statement. Sometimes we use cql_statement object
without a prepared statement object (e.g. BATCHes).
Ideally we'd want to capture the CQL text right in the parser, but
due to complicated transformations above that would require
patching dozens of call sites.
This series moves raw_cql_statement from class prepared_statement
to its nested object, cql_statement, batches, and initializes this
variable in all major call sites. View prepared statements and
some internal DDL statements still skip setting it.
"
* 'query_processor_trace_cql_v2' of https://github.com/kostja/scylla:
query_processor: add CQL logging to all major execute call sites.
query_procesor: move raw_cql_statement to cql_statement
query_processor: set raw_cql_statement consistently
In the state of Alternator in docs/alternator/alternator.md, we said that
BatchWriteItem doesn't check for duplicate entries. That is not true -
we do - and we even have tests (test_batch_write_duplicate*) to verify that.
So drop that comment.
Refs #5698. (there is still a small bug in the duplicate checking, so still
leaving that issue open).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200219164107.14716-1-nyh@scylladb.com>
In commit 388b492040, which was only supposed
to move around code, we accidentally lost the line which does
_executor.local()._stats.total_operations++;
So after this commit this counter was always zero...
This patch returns the line incrementing this counter.
Arguably, this counter is not very important - a user can also calculate
this number by summing up all the counters in the scylla_alternator_operation
array (these are counters for individual types of operations). Nevertheless,
as long as we do export a "scylla_alternator_total_operations" metric,
we need to correctly calculate it and can't leave it zero :-)
Fixes#5836
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200219162820.14205-1-nyh@scylladb.com>
It will store the ranges to be invalidated in row cache on compaction
completion. Intended to be used by cleanup compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compaction_completion_desc will eventually store more information that can be
customized by the compaction type.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This descriptor contain all information needed for table to be properly
updated on compaction completion. A new member will be added to it soon,
which will store ranges to be invalidated in row cache on behalf of
cleanup compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
... when clustering key is unavailable' from Benny
This series fixes null pointer dereference seen in #5794efd7efe cql3: generate_base_key_from_index_pk; support optional index_ck
7af1f9e cql3: do_execute_base_query: generate open-ended slice when clustering key is unavailable
7fe1a9e cql3: do_execute_base_query: fixup indentation
Fixes#5794
Branches: 3.3
Test: unit(dev) secondary_indexes_test:TestSecondaryIndexes.test_truncate_base(debug)
* bhalevy/fix-5794-generate_base_key_from_index_pk:
cql3: do_execute_base_query: fixup indentation
cql3: do_execute_base_query: generate open-ended slice when clustering key is unavailable
cql3: generate_base_key_from_index_pk; support optional index_ck
There has been recently discussed several problems when stopping
migration manager and features.
The first issue is with migration manager's schema pull sleeping
and potentially using freed migration manager instances.
Two others are with freeing database and migration manager before
features they wait for are enabled.
1. Only call base_ck = generate_base_key_from_index_pk<...
if the base schema has a clustering key.
2. Only call command->slice.set_range(*_schema, base_pk, ...
if the base schema has a clustering key,
otherwise just create an open ended range.
Proposed-by: Piotr Sarna <sarna@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When dropping a table, the table and its views are dropped
in parallel, this is not a problem as for itself but we
have mechanism to snapshot a deleted table before the
actual delete. When a secondary index is removed, in the
snapshot process it looks for it's schema for creating the
schema part of the snapshot but if the main table is already
gone it will not find it.
This commit serializes views and main table removals and
removes the views prior to the tables.
See discussion on #5713
Tests:
Unit tests (dev)
dtest - A test that failed on "can't find schema" error
Fixes#5614
* eliran/serialize_table_views_deletion:
Materialized Views: serialize tables and views creation
Materialized Views: drop materialized views before tables
Now the database keeps reference on feature service, so we
can listen on the feature in it directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The maybe_schedule_schema_pull waits for schema_tables_v3 to
become available. This is unsafe in case migration manager
goes away before the feature is enabled.
Fix this by subscribing on feature with feature::listener and
waiting for condition variable in maybe_schedule_schema_pull.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The get_string_attribute() function used attribute_value->GetString()
to return an std::string. But this function does not actually return a
std::string - it returns a char*, which gets implicitly converted to
an std::string by looking for the first null character. This lookup is
unnecessary, because rjson already knows the length of the string, and
we can use it.
This patch is just a cleanup and a very small performance improvement -
I do not expect it fixes any bugs or changes anything functional, because
JSON strings anyway cannot contain verbatim nulls.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200219101159.26717-1-nyh@scylladb.com>
When called from indexed_table_select_statement::do_execute_base_query,
old_paging_state->get_clustering_key() may return un-engaged
optional<clustering_key>. Dereferencing it unconditionally crashes
scylla as seen in https://github.com/scylladb/scylla/issues/5794Fixes#5794
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The sleep is interrupted with the abort source, the "wait" part
is done with the existing _background_tasks gate. Also we need
to make sure the gate stays alive till the end of the function,
so make use of the async_sharded_service (migration manager is
already such).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This change serializes tables and views creation. The
changes purpose is to avoid future possible races due to
a view searching for its base table information while the
later haven't been created yet.
When dropping a table, the table and its views are dropped
in parallel, this is not a problem as for itself but we
have mechanism to snapshot a deleted table before the
actual delete. When a secondary index is removed, in the
snapshot process it looks for its schema for creating the
schema part of the snapshot but if the main table is already
gone it will not find it.
This commit serializes views and main table removals and
removes the views prior to the tables.
See discussion on https://github.com/scylladb/scylla/pull/5713
Tests:
Unit tests (dev)
dtest - A test that failed on "can't find schema" error
Fixes#5614
De-pointerize cql3 code APIs further: change some call sites
to pass `schema` as const-ref instead of `shared_ptr`.
Affected functions known to be expecting always non-null
pointer to schema and don't store or pass the pointer somewhere
else, assuming it's safe to give them just a reference.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200218142338.69824-1-pa.solodovnikov@scylladb.com>
This patch fixes a bug that appears because of an incorrect interaction
between counters and hinted handoff.
When a counter is updated on the leader, it sends mutations to other
replicas that contain all counter shards from the leader. If
consistency level is achieved but some replicas are unavailable, a hint
with mutation containing counter shards is stored.
When a hint's destination node is no longer its replica, it is
attempted to be sent to all its current replicas. Previously, if the
cluster did not have the feature HINTED_HANDOFF_SEPARATE_CONNECTION
enabled, storage_proxy::mutate function would be used for the purpose of
sending the hint. It was incorrect because that function treats
mutations for counter tables as mutations containing only a delta (by
how much to increase/decrease the counter). These two types of mutations
have different serialization format, so in this case a "shards" mutation
is reinterpreted as "delta" mutation, which can cause data corruption to
occur.
This patch fixes the case when HINTED_HANDOFF_SEPARATE_CONNECTION is
disabled, and uses storage_proxy::mutate_internal, which treats "shards"
mutation as regular mutations - which is the correct behavior.
Refs #5833.
Tests: unit(dev)
Refs #817
Truncation is potentially long. It has its own timeout in storage
proxy/rpc. This value should probably also be higher than default
timeout.
Message-Id: <20200218135926.26522-1-calle@scylladb.com>
The clear_snapshot method signature was modified and accept a table name
parameter.
This patch adds an empty table name to the clear_snapshot test so it
would compile and pass.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
There are cases when it is useful to delete specific table from a
snapshot.
An example is when a snapshot is used for backup. Backup can take a long
period of time, during that time, each of the tables can be deleted once
it was backup without waiting for the entire backup process to
completed.
This patch adds such an option to the database and to the storage_service
wrapping method that calls it.
If a table is specified a filter function is created that filter only
the column family with that given name.
This is similar to the filtering at the keyspace level.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds additional tests for the ReturnValues feature to make the
test even more comprehensive. As this feature is not yet implemented in
Alternator (see issue #5053), all tests XFAIL on Alternator - except two
tests for the trivial "NONE" mode which is already supported. As usual
all tests pass on DynamoDB.
This patch also splits the tests for the ReturnValues parameter in the
UpdateItem operation into multiple tests, each testing one of the different
modes which DynamoDB supports - NONE, ALL_OLD, UPDATED_OLD, ALL_NEW and
UPDATED_NEW. The separate tests will be useful if we implement this feature
incrementally - so the separate modes can be tested separately.
Refs #5053.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200218085618.5584-1-nyh@scylladb.com>
This patch adds a new document, docs/protocols.md, about all the different
protocols which Scylla supports - and the different ports which they use.
This includes Scylla's internal protocol, user-facing protocols (CQL, Thrift,
DynamoDB, Redis, JMX) and things inbetween (REST API, Prometheus).
I wrote this document after being frustrated that when I see a port number
(e.g., "7000") or a port option name (e.g., "storage_port") it's hard to
figure out what they actually are - or why they are given such strange
names. The intention is that this file can easily be searched for option
names, for more familiar names (e.g., "CQL"), and a reader can get the
whole story - including some pointers to relevant part of the code (this
part of the document can be improved further - in this version this only
exists for the internal protocol).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200217172049.25510-1-nyh@scylladb.com>
* seastar c7c249f67d...2b51022073 (8):
> dns_test: Test with seastar.io instead of www.google.com
> sharded: fix move constructor for peering_sharded_service
Fixes#5814.
> tests: Delete Seastar.dist
> reactor: distinguish structs from classes when befriending
> util/tuple_utils.hh: avoid redundant move
> io_request: do not include fmt/format.h
> reactor: cleanup write_some leftover
> posix: change the signature of accept/try_accept
"
This change set is comprised of several unrelated patches regarding
some cleanups in cql3 layer code.
Most of the changes are aimed at eliminating superfluous `shared_ptr`
usages. In places where it can be safely assumed that objects passed
to the function are considered non-null and constant, these places
were adjusted to use passing as const ref instead.
Other changes incude eliminating unused arguments at some functions
and replacing usages of `shared_ptr<service::pager::paging_state>`
to use `lw_shared_ptr` instead, since `pager::paging_state` is final.
Tests: unit(dev, debug)
"
* 'feature/cql_cleanups_4' of https://github.com/ManManson/scylla:
cql3: minor sweeps through the cql layer code to reduce shared_ptrs count
cql3: change some function signatures to accept const references
cql3: change signatures of several functions to return crefs instead of pointers
cql3: remove unused argument at functions::castas_functions::get
paging_state: switch from shared_ptr to lw_shared_ptr
When replaying a hint with a destination node that is no longer in the
cluster, it will be sent with cl=ALL to all its new replicas. Before
this patch, the MUTATION verb was used, which causes such hints to be
handled on the same connection and with the same priority as regular
writes. This can cause problems when a large number of hints is
orphaned and they are scheduled to be sent at once. Such situation
may happen when replacing a dead node - all nodes that accumulated hints
for the dead node will now send them with cl=ALL to their new replicas.
This patch changes the verb used to send such hints to HINT_MUTATION.
This verb is handled on a separate connection and with streaming
scheduling group, which gives them similar priority to non-orphaned
hints.
Refs: #4712
Tests: unit(dev)
" from Botond
Nodetool scrub rewrites all sstables, validating their data. If corrupt
data is found the scrub is aborted. If the skip-corrupted flag is set,
corrupt data is instead logged (just the keys) and skipped.
The scrubbing algorithm itself is fairly simple, especially that we
already have a mutation stream validator that we can use to validate the
data. However currently scrub is piggy-backed on top of cleanup
compaction. To implement this flag, we have to make scrub a separate
compaction type and propagate down the flag. This required some
massaging of the code:
* Add support for more than two (cleanup or not) compaction types.
* Allow passing custom options for each compaction type.
* Allow stopping a compaction without the manager retrying it later.
Additionally the validator itself needed some changes to allow different
ways to handle errors, as needed by the scrub.
Fixes: #5487
* https://github.com/denesb/nodetool-scrub-skip-corrupted/v7:
table: cleanup_sstables(): only short-circuit on actual cleanup
compaction: compaction_type: add Upgrade
compaction: introduce compaction_options
compaction: compaction_descriptor: use compaction options instead of
cleanup flag
compaction_manager: collect all cleanup related logic in
perform_cleanup()
sstables: compaction_stop_exception: add retry flag
mutation_fragment_stream_validator: split into low-level and
high-level API
compaction: introduce scrub_compaction
compaction_manager: scrub: don't piggy-back on upgrade_sstables()
test: sstable_datafile_test: add scrub unit test
CDC can get all it needs from a config and does not need
partitioner.
For base table specific operations CDC is using partitioner
from that table (obtained with schema::get_partitioner).
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
All the places that use partitioner have been switched
to not use global partitioner any more and we can stop
setting it in this test.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
random_schema already has a _schema field which in turn
has a get_partitioner() function. Store partitioner
in random_schema is redundant.
At the moment all uses of random_schema are based on
default partitioner so it is not necessary to set it
explicitly. If in the future we need random_schema to
work with other partitioners we will add the constructor
back and fix the creation of _schema to contain it. It's
not needed now though.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
parse functions now take const schema& which allows
them to reach a partitioner. It's safe to take schema
by const& because the only caller takes the schema
from an sstable object.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
and replace all calls to dht::global_partitioner().get_token
dht::get_token is better because it takes schema and uses it
to obtain partitioner instead of using a global partitioner.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
and replace all dht::global_partitioner().decorate_key
with dht::decorate_key
It is an improvement because dht::decorate_key takes schema
and uses it to obtain partitioner instead of using global
partitioner as it was before.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Take const schema& as a parameter of shard_of and
use it to obtain partitioner instead of calling
global_partitioner().
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
ring_position_exponential_sharder calls global_partitioner
in one constructor. Luckily the constructor is never used so
we can remove that constructor.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This requires a change in a repair that uses
selective_token_range_sharder.
Repair performs operation on a set of tables. We will have to
make sure that all of that tables use the same partitioner.
This is achieved by adding a check to a repair_info constructor.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Remove ring_position_range_sharder(nonwrapping_range<ring_position>)
which calls another constructor with partitioner obtained with
dht::global_partitioner().
Fix all the places the removed constructor was used and obtain
partitioner from schema instead.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
i_partitioner.hh is widely included while sharders are used
only in 6 places so there's no need to include them in
the whole codebase.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The next patch is moving sharders to a separate header.
ring_position_exponential_vector_sharder is not used anywhere
so instead of just silently removing it with the move, this
commit is separated to make it clear the class is removed.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The plan is to remove dht::global_partitioner()
and use schema::get_partitioner() instead.
This will allow a usage of per schema/table partitioner
instead of a single global partitioner everywhere.
Initially schema::get_partitioner will call
dht::global_partitioner. After all the calls
to dht::global_partitioner are switched to
schema::get_partitioner, the ability to set per schema
partitioner will be implemented.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
clustering_interval_set is a rarely used class, but one that requires
boost/icl, which is quite heavyweight. To speed up compilation, move
it to its own header and sprinkle #includes where needed.
Tests: unit (dev)
Message-Id: <20200214190507.1137532-1-avi@scylladb.com>
Merged patch series from Avi Kivity:
token_metadata is a heavyweight class with heavyweight includes
(boost/icl) it is a good candidate for the pimpl pattern, which
this series implements.
Tests: unit (dev)
https://github.com/avikivity/scylla token_metadata-pimplification/v1
Avi Kivity (6):
locator: token_metadata: use non-deduced return type for ring_range()
locator: token_metadata: pimplify
locator: token_metadata: make token_metadata_impl::tokens_iterator a
non-nested class
locator: token_metadata: pimplify tokens_iterator
locator: token_metadata: move implementation classes to .cc
locator: token_metadata: remove unused include "query-request.hh"
locator/token_metadata.hh | 783 +---------------
locator/token_metadata.cc | 1338 ++++++++++++++++++++++++++-
test/boost/sstable_datafile_test.cc | 1 +
3 files changed, 1332 insertions(+), 790 deletions(-)
Message-Id: <20200214184954.1130194-1-avi@scylladb.com>
Merged patch series from Piotr Sarna:
This miniseries implements graceful shutdown for alternator
by introducing two mechanisms:
- refusing to accept new requests during shutdown
by stopping the HTTP/HTTPS server(s)
- guarding pending requests with a gate, so that
when alternator server is stopped, no in-flight
alternator requests are being processed
Fixes#5781
Tests: manual(stopping Scylla in the middle of alternator-test
multiple times, used to crash every time
with local_is_initialized() assertion)
Piotr Sarna (3):
alternator: implement stopping alternator server
alternator: guard pending alternator requests with a gate
alternator: guard alternator-specific handlers with a gate
alternator/server.cc | 64 +++++++++++++++++++++++++++++++++++---------
alternator/server.hh | 4 +++
main.cc | 11 ++++++--
3 files changed, 64 insertions(+), 15 deletions(-)
Convert some more helper functions to accept const reference to
column_specification and column_identifier instead of shared_ptr.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This patch continues the effort of reducing shared_ptr's count
in the different APIs throughout the cql3 code tree.
These functions now pass cref to column_specification instead of
shared_ptr:
* multiple variants of `validate_assignable_to`
* sets::value_spec_of
* lists::value_spec_of
* lists::index_spec_of
* lists::uuid_index_spec_of
* tuples::component_spec_of
* user_types::field_spec_of
These functions don't pass the shared_ptr around down the call
hierarchy, also obviously assuming that the column_specification
passed is always non-null.
So it's safe to assume that they don't borrow the ownership of
the pointer or knowingly prolongate lifetime of the object
pointed by.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The following functions now accept const reference to
column_specification instead of shared_ptr:
* lists::index_spec_of
* lists::value_spec_of
* lists::uuid_index_spec_of
* sets::value_spec_of
Changed maps::value_spec_of and maps::key_spec_of signatures
to accept const ref instead of non-const ref to
column_specification.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Change the way `service::pager::paging_state` is passed around
from `shared_ptr` to `lw_shared_ptr`. It's safe since
`paging_state` is final.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Helper function for registering metrics for an endpoint,
register_metrics_for(ep) depends on an external state to be updated.
It checks if given metrics are added to a map, and if not, the metrics
are registered, but the mentioned map is expected to be updated
by the caller (e.g. get_ep_stat). This behaviour is error-prone,
because calling this function twice will result in an exception,
since registering metrics twice is not allowed.
Refs #5697
Message-Id: <5a9ddccf52861749dbda4204b5d098cc77bc51eb.1581855769.git.sarna@scylladb.com>
Alternator is able to serve more requests than its database operations,
e.g. a health check and returning the list of its nodes.
These operation, for safety, are no also guarded by the pending
requests gate.
In order to make sure that pending alternator requests are processed
during shutdown, a gate for each shard is introduced. On shutdown,
each gate will be closed and all in-progress operations will be waited upon.
Fixes#5781
Stopping Scylla with alternator enabled is not clean,
because the server does not stop accepting requests
on shutdown, which leads to use-after-free events.
The first step towards a cleaner solution is to implement
alternator_server::stop(), which stops the HTTP/HTTPS servers.
Refs #5781
The instructions in docs/alternator/getting-started.md on how to run
Alternator with docker are outdated and confusing, so this patch updates
them.
First, the instructions recommended the scylladb/scylla-nightly:alternator
tag, but we only ever created this tag once, and never updated it. Since
then, Alternator has been constantly improving, and we've caught up on
a lot of features, and people who want to test or evaluate Alternator
will most likely want to run the latest nightly build, with all the latest
Alternator features. So we update the instructions to request the latest
nightly build - and mention the need to explictly do "docker pull" (without
this step, you can find yourself running an antique nightly build, which
you downloaded months ago!). This instruction can be revisited once
Alternator is GAed and not improving quickly and we can then recommend to
run the latest stable Scylla - but I think we're not there yet.
Second, in recent builds, Alternator requires that the LWT feature is
enabled, and since LWT is still experimental, this means that one needs
to add "--experimental 1" to the "docker run" command. Without it, the
command line in getting-started.md will refuse to boot, complaining that
Alternator was enabled but LWT wasn't. So this patch adds the
"--experimental 1" in the relevant places in the text. Again, this
instruction can and should be revisited once LWT goes out of experimental
mode.
Fixes#5813
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200216113601.9535-1-nyh@scylladb.com>
This patch adds to Alternator's Query operation full support for the
KeyConditionExpression parameter - a newer syntax for specifying which
partition and which sort-key range are to be queried. The older syntax
for the same thing, "KeyConditions", was already supported by Alternator.
The patch also includes additional test cases for more corner cases
discovered during the development. After this patch, all 47 test cases
in test_key_condition_expression.py pass on Alternator (and, of course,
also on DynamoDB).
One interesting thing to note about this patch is that it does *not*
include a new parser for the KeyConditionExpression syntax. It turns out
that we need - to be fully compatible with DynamoDB - to use the
already existing parser for *ConditionExpression* syntax, and then forbid
certain things not allowed in KeyConditionExpression (you can see a lot
of examples in code comments and in the tests included in this patch).
Most importantly, allowing the full ConditionExpression syntax also
means we allow completely useless parentheses on key conditions, e.g.,
'((p=:p) AND (c=:c))'. While the KeyConditionExpression documentation
doesn't mention allowing these parentheses, DynamoDB does support them -
and it turns out that boto3 uses them when you use its condition builders,
as we do in one test case (test_query_key_condition_expression).
Fixes#5037.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200213192509.32685-4-nyh@scylladb.com>
We had a get_key_from_typed_value() utility function to decode a
JSON-encoded value with a known type (the JSON encoding is a map whose
key is the type, the value always a string because all possible key types -
string, bytes and number, are encoded as strings).
However, the function was less useful than it could have been - it was
missing one check for a malformed object (a check which only appeared in
one of its callers), it unnecessarily received the column's expected type
(all the callers passed it the given key column's type).
The cleaned up function will be more useful for the following patch
to support KeyConditionExpression, which wants to reuse it.
While at it, this patch also uses rjson::to_string_view(it->value)
instead of the less correct it->value.GetString() (the latter relies
on null-termination, which is actually true for JSON strings, but there
is no reason to rely on it).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200213192509.32685-3-nyh@scylladb.com>
conditions.cc contains a useful utility function for extracting (without
copying) a string_view from a rjson::value which is known to contain a
string. This function will be useful in more Alternator code, so let's
extract it to rjson.hh, with the name rjson::to_string_view()
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200213192509.32685-2-nyh@scylladb.com>
The table::flush_streaming_mutations is used in the days when streaming
data goes to memtable. After switching to the new streaming, data goes
to sstables directly in streaming, so the sstables generated in
table::flush_streaming_mutations will be empty.
It is unnecessary to invalidate the cache if no sstables are added. To
avoid unnecessary cache invalidating which pokes hole in the cache, skip
calling _cache.invalidate() if the sstables is empty.
The steps are:
- STREAM_MUTATION_DONE verb is sent when streaming is done with old or
new streaming
- table::flush_streaming_mutations is called in the verb handler
- cache is invalidated for the streaming ranges
In summary, this patch will avoid a lot of cache invalidation for
streaming.
Backports: 3.0 3.1 3.2
Fixes: #5769
* seastar 6d2ed8cdc...c7c249f67 (3):
> reactor: fix issue with hrtimer completions being lost
> Merge "refactor network and storage I/O handling in backend code" from Glauber
> reactor: don't call set_heap_profiling_enable() if not needed
Running cdc_test binary fails with a segmentation fault
when run with --smp 1, because test_cdc_across_shards
assumes shard count to be >=2. This patch skips the test case
when run with a single shard and produces a log warning.
Message-Id: <9b00537db9419d8b7c545ce0c3b05b8285351e7d.1581600854.git.sarna@scylladb.com>
There is a case in current PAXOS implementation where timeout is
returned because the code cannot guaranty whether the value is accepted
or not in case of a contention. The counter will help to correlate this
condition with failed requests.
Message-Id: <20200211160653.30317-2-gleb@scylladb.com>
This series implements keyspace-per-table approach for Alternator.
The changes are as follows:
- when a table is created, its keyspace is created first
- after table deletion, its keyspace is deleted as well;
works with views too, since these must be deleted
before the base table is dropped
- instead of SimpleStrategy, network topology is used
Keyspaces are created with a prefix not legal from CQL - 'a#'.
I validated that even though not reachable via CQL, keyspaces
created with # character work well and produce correct directories,
restarts work flawlessly too.
Fixes#5611
Refs #5596
Tests: alternator(local, remote)
Piotr Sarna (3):
alternator: switch to keyspace-per-table approach
alternator: move to NetworkTopologyStrategy
alternator-test: add test for recreating a table
Cells in CDC logs used to be created while completely neglecting TTLs
(the TTLs from cdc = {...'ttl':600}). This patch adds TTLs to all cells;
there are no row markers, so wee need not set TTL there.
Fixes#5688
* jul-stas/5688-set-ttl-in-cdc-log-table:
tests/cdc: added test for TTL on log table cells
cdc: set TTLs on CDC log cells
Merged patch series from Piotr Sarna:
This series addresses and fixes#5758 by providing less terse
configuration for write isolation. Before the patch,
suggested values for alternator write isolation policies was one of
'f', 'a', 'o', 'u', which are not really descriptive.
The code actually checks only the first character from the tag value,
but now the input is validated to allow only specific, expressive values:
* 'a', 'always', 'always_use_lwt' - always use LWT
* 'o', 'only_rmw_uses_lwt' - use LWT only for requests that require
read-before-write
* 'f', 'forbid', 'forbid_rmw' - forbid statements that need read-before-
write. Using such statements
(e.g. UpdateItem with ConditionExpression) will result in an error
* 'u', 'unsafe', 'unsafe_rmw' - (unsafe) perform read-modify-write without
any consistency guarantees
Using other values will result in an error.
This series comes with tests and docs updates.
Fixes#5758
Tests: alternator-test(local,remote)
Piotr Sarna (5):
alternator: move rmw_operation to a header
alternator: add validating write_isolation tag
alternator-test: add test for write isolation tag
alternator-test: mark write isolation tests scylla_only
docs: update write isolation documentation
alternator-test/test_condition_expression.py | 10 +-
alternator-test/test_tag.py | 9 +
alternator/executor.cc | 163 +++++++------------
alternator/rmw_operation.hh | 99 +++++++++++
docs/alternator/alternator.md | 8 +-
5 files changed, 173 insertions(+), 116 deletions(-)
Empty values (zero-sized string in serialized form) were not
handled properly in serialize routines for floating types and
uuids, which led to runtime exceptions and failing tests as
described in https://github.com/scylladb/scylla/issues/5782.
Also fix validation visitor to handle empty values properly.
There already was the code in place that took into
consideration zero-sized values. But it was trying to read
some bytes regardless of that (e.g. for timeuuid values),
even if there is none to read.
Tests: unit(dev, debug)
Fixes: #5782
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200213130021.31598-1-pa.solodovnikov@scylladb.com>
On start, test.py cleans up testlog directory.
The cleanup file search pattern was shell style, not python
glob style, which led to .log files being left around
between runs.
Message-Id: <20200212204047.22398-9-kostja@scylladb.com>
Always open the log file first, this will be necessary to append
output to it in case the test timed out or didn't start.
Message-Id: <20200212204047.22398-5-kostja@scylladb.com>
To be able to easily see what tests have failed as they run,
print failed tests on their own line even if --verbose switch is off.
Message-Id: <20200212204047.22398-4-kostja@scylladb.com>
test.py used a functional programming cookie pattern to
carry tabular console output state, convert this cookie
to an object.
In order to make console output more pretty we'll need to
add more state to it, and keeping this state in a tuple
would be too messy.
Message-Id: <20200212204047.22398-3-kostja@scylladb.com>
In order to pimplify token_metadata_impl::tokens_iterator, we must make it
a non-nested class, since eventually token_metadata_impl will be an incomplete
class for users and nested classes cannot be forward declared. So this patch
makes it a non-nested class. Two inline functions that referred to it were
moved out of class scope so they can see the definition.
No functional changes.
token_metadata is a heavyweight class, with heavyweight include
dependencies (icl, which has tens of thousands of lines in headers),
heavyweight methods, but it rarely used. So it is a classic candidate
for pimmplication.
This patch splits off the implementation into token_metadata_impl
and leaves token_metadata as a forwarding class. Actual movement
of the code is left to a later patch to ease review.
Notes:
- some constructors were made public due to limitations of std::make_unique
- a few token_metadata methods pass *this along to external functions, so we
now pass the holder object as "unpimplified_this" to support this.
Now that we have the necessary infrastructure to do actual scrubbing,
don't rely on `upgrade_sstables()` anymore behind the scenes, instead do
an actual scrub.
Also, use the skip-corrupted flag.
A specialized compaction subclass for executing a scrub compaction.
`scrub_compaction` supplies a specialized reader which will validate its
input and stop on the first error. If it is configured with
`skip_corrupted`, it will instead skip bad data, logging it.
The low-level validator allows fine-grained validation of different
aspects of monotonicity of a fragment stream. It doesn't do any error
handling. Since different aspects can be validated with different
functions, this allows callers to understand what exactly is invalid.
The high-level API is the previous fragment filter one. This is now
built on the low-level API.
This division allows for advanced use cases where the user of the
validator wants to do all error handling and wants to decide exactly
what monotonicity to validate. The motivating use-case is scrubbing
compaction, added in the next patches.
Write isolation tags now accept only a small set of valid values.
The test case ensures that all valid values are accepted
and that invalid values return an error.
In order to prevent users from using incorrect write isolation
configuration, a set of allowed values is introduced.
When tagging a resource (which is considered rare), a tag
will only be allowed if it belongs to the allowed set.
rmw_operation is a class with a public interface, including
a write_isolation enum and a fixed tag name for its configuration.
For convenience, it's moved to a header file, so that code
from executor.cc can use the definitions regardless of their
position in the source file - it prevents reordering functions
just to make sure that rmw_operation is defined before a function
that uses its attributes.
The first iteration of keyspace-per-table approach for alternator
revealed an issue with recreating a table after deleting it.
This test case was used as a regression check.
Imstead of SimpleStrategy, NetworkTopologyStrategy is used
for setting up the replication configuration for alternator tables.
Replication factor 3 is used along with a local datacenter,
unless alternator discovers that it's running on a test cluster with
less than 3 nodes - then, RF is reduced accordingly and emits a warning,
which was also the case for SimpleStrategy.
Instead of a monolith alternator keyspace, each table creates its own
keyspace, named in the following pattern: `a#TABLE_NAME`.
The `a#` prefix contains an illegal CQL character in order to ensure
that these keyspaces are never created via CQL.
raw_cql_statement is a member of prepared_statement which
is not set in its constructor because prepared_statement
constructor has too many call sites inside cql_statement
hierarchy.
cql_statement and prepared_statement dependency form a
cycle and long term it obviously should be fixed.
As a quick fix to query processor tracing, consistently
assign raw_cql_statement in all prepared_statement
usage sites.
The update generation path must track and apply all tombstones,
both from the existing base row (if read-before-write was needed)
and for the new row. One such path contained an error, because
it assumed that if the existing row is empty, then the update
can be simply generated from the new row. However, lack of the
existing row can also be the result of a partition/range tombstone.
If that's the case, it needs to be applied, because it's entirely
possible that this partition row also hides the new row.
Without taking the partition tombstone into account, creating
a future tombstone and inserting an out-of-order write before it
in the base table can result in ghost rows in the view table.
This patch comes with a test which was proven to fail before the
changes.
Branches 3.1,3.2,3.3
Fixes#5793
Tests: unit(dev)
Message-Id: <8d3b2abad31572668693ab585f37f4af5bb7577a.1581525398.git.sarna@scylladb.com>
These series were born when working on debugging (missing)
query processor trace-level logging, and trying to identify
all entry points into parsed_statement::prepare().
Unfortunately I was unable to easily merge prepared_statement
and cql_statement objects.
Rationale for individual patches is given in commit comments.
While Alternator doesn't yet support creating a table with streams
(i.e., CDC) turned on, we should only failed the creation if streams
were really turned on. If the StreamSpecification option exists, but
does *not* ask to turn on streams, we should not fail the creation -
and this patch fixes this.
This patch also adds two tests - one where StreamSpecification is
passed but does not ask to turn on streams (so table creation should
succeed), and another test which explicitly requests to turn on
streams. The second test still xfails on Alternator, and should continue
to do so until we implement streams (we do *not* want to silently
ignore a request to turn on streams).
Fixes#5796
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200212100546.16337-1-nyh@scylladb.com>
boost/regex has huge header dependencies amounting to tens of thousands of
lines. This are now replicated in 167 translation units.
This patch converts like_matcher to use the pointer-to-implementation
idiom, which reduces the number of translations including boost/regex
to 28.
Since regular expressions are relatively expensive, and like_matcher is
relatively rare, the extra memory usage and run time will be
negligible.
Message-Id: <20200211170152.809554-1-avi@scylladb.com>
cql3 has cql_statement, parsed_statement and prepared_statement
classes, which, largely, stand for the same thing. prepared was
an alias for prepared_statement which only required an extra
tag jump in IDE and carried no meaning.
All internal execution always uses query text as a key in the
cache of internal prepared statements. There is no need
to publish API for executing an internal prepared statement object.
The folded execute_internal() calls an internal prepare() and then
internal execute().
execute_internal(cache=true) does exactly that.
Rename an overloaded function process() to execute_direct().
Execute direct is a common term for executing a statement
that was not previously prepared. See, for example
SQLExecuteDirect in ODBC/SQL CLI specification,
mysql_stmt_execute_direct() in MySQL C API or EXECUTE DIRECT
in Postgres XC.
The list_snapshot API, uses http stream to stream the result to the
caller.
It needs to keep all objects and stream alive until the stream is closed.
This patch adds do_with to hold these objects during the lifetime of the
function.
Fixes#5752
Refs #4924
truncate_exception should, like its origin counterpart, set
error code to TRUNCATE_ERROR, not PROTOCOL_ERROR.
tests: unit + partial dtest
Message-Id: <20200212100920.14478-1-calle@scylladb.com>
* seastar 1c7bccc500...6d2ed8cdc6 (11):
> connect_test: keep socket alive until the end.
> Merge "Add timeout to smp::submit_to() and friends" from Botond
> reactor: use reference to addrlen in accept
> tests: stall_detector_test: use same clock as in test as in the detector
> reactor: fallback to epoll backend when fs.aio-max-nr is too small
> util: move read_sys_file_as() from iotune to seastar header, rename read_first_line_as()
> core/resources: fix cpuset error
> distributed_tests: increase sleep time further
> core: thread: Fix compilation error in comment
> reactor: specialize the pollable_fd_state
> build: Use with -fstack-clash-protection when using guard pages
"
Lots of code needs storage_service just to get token_metadata from.
This creates unwanted dependency loops and increases the use of
global storage_service instance.
This set keeps the sharded<locator::token_metadata> on main's stack
and carries the references where needed. This removes the dependency
on storage_service from:
- storage_proxy
- gossiper
- redis
- batchlog manager
and makes the database only need it for sstables_format (will fix
in one of the next sets).
Also, this set is the prerequisite for controlling the copying of
token_metadata instances (spotted two occurrences in bootstrap
code).
Tests: unit(dev), manual start-stop
"
* 'br-token-metadata-standalone-2' of https://github.com/xemul/scylla:
api: Keep and use reference on token_metadata
redis: Use proxy token_metadata
gossiper: Keep needed for failure_detection values on board
database: Use own token_metadata
batchlog: Use token_metadata from proxy
proxy: Use own token_metadata
gossiper: Use own token_metadata
tokens: Switch into standalone sharded instance
batchlog: Use in-config ring-delay
database: Have it in size_estimate_virtual_reader
storage_proxy: Pass token_metadata in some static helpers
storage_service: Move get_local_tokens wrapper
size_estimates_virtual_reader: Make get_local_ranges static
migration_manager: Refactor validation of new/updating ksm
storage_service: Tiny cleanup of excessive self-reference
Allow the thrower to communicate that it doesn't want the compaction to
be retried later. I know, using exceptions for control flow is *very*
bad, but this is the existing mechanism to stop a compaction and I don't
want to invent a new one for this.
Also massage the error messages a bit to take the value of this flag
into consideration.
"
client_state is used simultaneously by many requests running in parallel
while tracing state pointer is per request. Both those facts do not sit
well together and as a result sometimes tracing state is being overwritten
while still been used by active request which may cause incorrect trace
or even a crash.
"
Fixes#5700.
* 'gleb/tracing_fix_v1' of github.com:scylladb/seastar-dev:
client_state: drop the pointer to a tracing state from client_state
transport: pass tracing state explicitly instead of relying on it been in the client_state
alternator: pass tracing state explicitly instead of relying on it been in the client_state
Currently the call chain for a cleanup collection looks like this:
compaction_manager::perform_cleanup()
compaction_manager::rewrite_sstables()
table::cleanup_sstables()
...
`perform_cleanup()` is essentially empty, immediately deferring to
`rewrite_sstables()`. Cleanup related logic is scattered between the
latter two methods on the call chain. These methods however recently
started serving as generic methods for compactions that want to
rewrite each sstable one-by-one, collecting cleanup related ifs in
various places.
The reason is historic, we first had cleanup, then bolted others on top,
trying to share the underlying code as much as possible.
It is time this is cleaned up (pun intended). Make `perform_cleanup()`
the place where all cleanup related logic is, with the rest of the stack
made truly generic.
Instead of the restrictive `cleanup` boolean flag, which allows for choosing
between only two compaction types, use `compaction_options`, which in
addition to allowing any number of compaction types to be selected,
also allows seamlessly passing specific options to them.
Currently the compaction API is quite restrictive. It offers a generic
`compact_sstables()` and `reshard_sstables()` methods. The former is the
one used by all but resharding, however it only really supports two
modes: regular and cleanup. The latter is supported by a semi-hidden
`cleanup` flag in `compaction_description`. Actually there are two more
compaction types already which are piggy-backed on cleanup: upgrade and
scrub. The upper layers distinguish between actual cleanup and "fake"
cleanup by a `is_actual_cleanup` flag. The latter two "fake" cleanup
compactions cannot be distinguished even by the upper layers.
This is terribly confusing and hard to follow, in addition to being
restrictive.
This worked so far, because upgrade is served quite well by the cleanup
compaction type, turning off certain preparations by the above mentioned
`is_actual_cleanup` flag. Scrub is barely implemented and just an
upgrade behind the scenes.
This situation is however preventing really specializing each
compaction. Enter `compaction_options`. This variant in disguise is
designed to allow passing specific option to each compaction type, and
doubles as an enum allowing more than two low level compaction type.
This patch only adds the option class itself, propagating and handling
it will be done by the next patches.
Although we currently do support upgrade compaction, it is piggy-backed
on top of cleanup compaction. This is soon going to change, so in
preparation to that, add an `Upgrade` member to the `compaction_type`
enum.
Currently the cleanup call is short circuited if it is determined that
cleanup is not needed for the sstable to-be-cleaned-up. This is
undesired because actually not just cleanup uses this routine to rewrite
sstables, sstable-upgrade and sstable-scrub also uses it, and they want
to go on with the cleanup compaction sstables even if all data in it
belongs to the current node.
Fix: #5699
Merged pull request https://github.com/scylladb/scylla/pull/5755 from
Avi Kivity:
This series removes some #include dependencies around cql3. It results in
30k line (6.6%) reduction in the preprocessed size of database.i, mainly
due to elimination of boost::regex (which was brought in in turn by
like_matcher). This should result in fewer and faster recompiles.
commits:
tracing: remove #include of modification_statement.hh from table_helper
cql3: selection: remove now-unneeded include of statement_restrictions.hh
cql3: deinline result_set_builder::restrictions_filter constructor
view_info: remove include of select_statement.hh
cql3: selection: remove unnecessary include of selector_factories
cql3: query_processor: reduce #includes
Cells in CDC logs used to be created while completely neglecting
TTLs (the TTLs from `cdc = {...'ttl':600}`). This patch adds TTLs
to all cells; there are no row markers, so wee need not set TTL
there.
Fixes#5688
One of the logging options for Scylla is syslog, this method,
until today wasn't supported in the docker images that are
created with the Dockerfile in the repo.
This commit add rsyslog installation, configuration and
setup for Docker.
Tests: built and ran the docker and validated the existance
of the /dev/log socket.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20200210112448.210169-1-eliransin@scylladb.com>
This reverts commit ca55c6c15f.
Triggers the broken promise exception on aborted stop.
If the feature service is stopped without enabling some features,
the later may end up with "broken promise" exception on futures
attached to the _pr promise.
This assert, added by 060e3f8 is supposed to make sure the invariant of
the append() is respected, in order to prevent building an invalid row.
The assert however proved to be too harsh, as it converts any bug
causing out-of-order clustering rows into cluster unavailability.
Downgrade it to on_internal_error(). This will still prevent corrupt
data from spreading in the cluster, without the unavailability caused by
the assert.
Fixes: #5786
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200211083829.915031-1-bdenes@scylladb.com>
Setting TTL = -1 in cdc_options prevents any writes to CDC log.
But enabling CDC and having unwritable log table makes no sense.
Notably, normal writes USING TTL -1 are forbidden. This patch does
the same to TTLs in CDC options.
Fixes#5747
* jul-stas/5747-cdc-disallow-negative-ttl:
tests/cdc: added test for exception when TTL < 0
cdc: disallow negative TTL values in CDC
When registering callbacks for row-level repair verbs the
sched groups is assigned automatically with the help of
messaging_service::scheduling_group_for_verb. Thus the
the lambda will be called in the needed sched group, no
need for manual switch.
This removes the last occurence of global storage_service
usage from row-level repair.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The do_repair_start() emulates db.invoke_on_all and can
re-use the db.local() inside without the need to call for
global storage_service instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The caller of repair_writer.create_writer al ready
have the needed reference on database, no need to
get it from global storage_service instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This kills the second global reference on storage_service from
batchlog code and breaks the dependency loop between these two.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Way too many places in code needs storage_service just for token_metadata.
These references increase the amount of get(_local)?_storage_service()
calls and create loops in components dependencies. Keep the token_metadata
separately from storage_service and pass instances' references where
needed (for now -- only into the storage_service itself).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This wrapper just makes sure the system_keyspace::get_saved_tokens
reports non empty result. Move them close together.
As a side effect -- get rid of penultimate global storage_service
reference from size_estimates_virtual_reader (the last one will
be removed soon).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch introduces following modifications:
Disallows enabling cdc for table X when X_scylla_cdc_log already exists,
Restricts DROP permissions for X_scylla_cdc_log tables,
Restricts ALTER and DROP permissions for cdc_description and cdc_topology_description,
Disallows cdc option when creating materialized views.
Refs #4991.
Tests: unit(dev).
* piodul/4991-permissions-for-cdc-tables:
cdc: disallow CDC options for materialized views
cdc: restrict permissions on cdc_(topology_)description
cdc: restrict permissions on _scylla_cdc_log tables
cdc: refuse to enable cdc when table _scylla_cdc_log exists
we have compaction_manager.compactions metric for the number of active tasks,
but they don't account for tasks blocked waiting for an opportunity to run,
and they're the problematic ones.
Fixes#5254.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200210131929.30981-1-raphaelsc@scylladb.com>
There's the call of the same name in storage_service, so
make this one explicitly static for better readability.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to have token_metadata reference intide the
keyspace_metadata.validate method. This can be acheived
by doing the validation through the database reference
which is "at hands" in migration_manager.
While at it, merge the validation with exists/not-exists
checks done in the same places.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
client_state is shared between requests and tracing state is per
request. It is not safe to use the former as a container for the later
since a state can be overwritten prematurely by subsequent requests.
Since dpkg does not re-install conffiles when it removed by user,
currently we are missing dependencies.conf and sysconfdir.conf on rollback.
To prevent this, we need to stop running
'rm -rf /etc/systemd/system/scylla-server.service.d/' on 'remove'.
Fixes#5734
Multiple requests can use the same client_state simultaneously, so it is
not safe to use it as a container for a tracing state which is per request.
Currently next request may overwrite tracing state for previous one
causing, in a best case, wrong trace to be taken or crash if overwritten
pointer is freed prematurely.
Fixes#5700
Multiple requests can use the same client_state simultaneously, so it is
not safe to use it as a container for a tracing state which is per
request. This is not yet an issue for the alternator since it creates
new client_state object for each request, but first of all it should not
and second trace state will be dropped from the client_state, by later
patch.
In bash, 'A || B && C' will be problem because when A is true, then it will be
evaluates C, since && and || have the same precedence.
To avoid the issue we need make B && C in one statement.
Fixes#5764
"
There's a lot of code around that needs storage service purely to
get the specific feature value (cluster_supports_<something> calls).
This creates several circular dependencies, e.g. storage_service <->
migration_manager one and database <-> storage_servuce. Also features
sit on storage_service, but register themselfs on the feature_service
and the former subscribes on them back which also looks strange.
I propose to keep all the features on feature_service, this keeps the
latter intependent from other components, makes it possible to break
one of the mentioned circle dependencyand heavily relax the other.
Also the set helps us fighting the globals and, after it, the
feature_service can be safely stopped at the very last moment.
Tests: unit(dev), manual debug build start-stop
"
* 'br-features-to-service-5' of https://github.com/xemul/scylla:
gossiper: Avoid string merge-split for nothing
features: Stop on shutdown
storage_service: Remove helpers
storage_service: Prepare to switch from on-board feature helpers
cql3: Check feature in .validate
database: Use feature service
storage_proxy: Use feature service
migration_manager: Use feature service
start: Pass needed feature as argument into migrate_truncation_records
features: Unfriend storage_service
features: Simplify feature registration
features: Introduce known_feature_set
features: Move disabled features set from storage_service
features: Move schema_features helper
features: Move all features from storage_service to feature_service
storage_service: Use feature_config from _feature_service
features: Add feature_config
storage_service: Kill set_disabled_features
gms: Move features stuff into own .cc file
migration_manager: Move some fns into class
Allows caller to check/wait for a given user keyspace to finish
populating on boot.
Can be called at any time, though if called before population
starts, it will wait until it either starts and we can determine
that the keyspace does not need populating, or population finishes.
tests: unit
Message-Id: <20200203151712.10003-1-calle@scylladb.com>
The disk-error-handler is purely auxiliary thing that helps
propagating IO errors to the rest of the code. It well
deserves not sitting in the root namespace.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200207112443.18475-1-xemul@scylladb.com>
This patch affects the LWT queries with IF conditions of the
following form: `IF col in :value`, i.e. if the parameter
marker is used.
When executing a prepared query with a bound value
of `(None,)` (tuple with null, example for Python driver), it is
serialized not as NULL but as "empty" value (serialization
format differs in each case).
Therefore, Scylla deserializes the parameters in the request as
empty `data_value` instances, which are, in turn, translated
to non-empty `bytes_opt` with empty byte-string value later.
Account for this case too in the CAS condition evaluation code.
Example of a problem this patch aims to fix:
Suppose we have a table `tbl` with a boolean field `test` and
INSERT a row with NULL value for the `test` column.
Then the following update query fails to apply due to the
error in IF condition evaluation code (assume `v=(null)`):
`UPDATE tbl SET test=false WHERE key=0 IF test IN :v`
returns false in `[applied]` column, but is expected to succeed.
Tests: unit(debug, dev), dtest(prepared stmt LWT tests at https://github.com/scylladb/scylla-dtest/pull/1286)
Fixes: #5710
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200205102039.35851-1-pa.solodovnikov@scylladb.com>
query_processor is a central class, so reducing its includes
can reduce dependencies treewite. This patch removes includes
for parsed_statement, cf_statement, and untyped_result_set and
fixes up the rest of the tree to include what it lacks as a result
of these removals.
This patch adds comprehensive tests for KeyConditionExpression, the newer
DynamoDB API syntax for specifying the item range which is requested from
a Query (this syntax replaced the older KeyConditions syntax, which
Alternator already supports).
Before this patch, we had only a small test for KeyConditionExpression
in test_query.py. This patch replaces it by a large number of small
tests, testing the many sub-features of KeyConditionExpression -
its different operators, sort-key types, different failure modes, etc.
As usual, because we haven't yet implemented this feature in Alternator
(see issue #5037), all these tests pass on AWS, but xfail on Alternator.
Despite the new test file containing about 40 small tests, it finishes
very quickly because we use pytest's fixture feature to allow small
read-only tests to perform a query to a partition that is only written
once for many tests. So these small tests become extremely fast, and
there is no downside to having many small tests instead of lumping them
into fewer large tests checking many things.
Refs #5037.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200207134159.3283-1-nyh@scylladb.com>
This patch changes the way TTL is stored in the CDC log table. Instead
of including TTL of cell `X` in the third element of the tuple in column
`_X`, TTL is written to the previously unused column `ttl`. This is done
for cosmetic purposes.
This implementation works under assumption that there will be only one
TTL included in a mutation coming from a CQL write. This might not be
the case when writing a batch that modifies the same row twice, e.g.:
```
BATCH
INSERT INTO ks.t (pk, ck, v1) VALUES (1,2,3) USING TTL 10;
INSERT INTO ks.t (pk, ck, v2) VALUES (1,2,3) USING TTL 20;
END BATCH
```
In this case, this implementation will choose only one TTL value to be
written in the CDC log:
```
... | batch_seq_no | _ck | _pk | _v1 | _v2 | operation | ttl
...-+--------------+-----+-----+--------+--------+-----------+-----
... | 0 | 2 | 1 | (0, 3) | (0, 3) | 1 | 20
```
This behavior might be changed as a part of issue #5719, which considers
splitting a batch write mutation when it contains multiple writes to the
same row.
Refs #5689
Tests: unit(dev)
The view_update_generator acceps (and keeps) database and storage_proxy,
the latter is only needed to initialize the view_updating_consumer which,
in turn, only needs it to get database from (to find column family).
This can be relaxed by providing the database from _generator to _consumer
directly, without using the storage_proxy in between.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200207112427.18419-1-xemul@scylladb.com>
Two places in dht code have token_metadata _value_ arguments, but only read
tokens from them. Optimize it a bit by turning values into const references.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200207112408.18352-1-xemul@scylladb.com>
Truncation time is used on each LWT request now, so reading it from
the table is too heave operation to be on a fast path. It also requires
jumping to a shard that contains corresponding data. This patch caches
the data on the table object of each shard for easy access. The cache is
initialized during boot from system.truncated table and updated on each
truncation operation.
Message-Id: <20200206163838.5220-2-gleb@scylladb.com>
We made --localrpm option to automatically build rpms from sourcecode,
but we actually use the option to produce AMI using prebuilt rpm on our
CI.
To simplified the script, and to prevent accsidently start rpm build
in the script, drop rpm build part.
get_snapshot should use http stream to reduce memory allocation and
stalls.
This patch change the implementation so it would stream each of the
snapshot object instead of creating a single response and return it.
Fixes#5468
Depends on scylladb/seastar#723
On our build system we tries to build relocatable package multiple times on
same revision of the repository, it executes ./SCYLLA-VERSION-GEN for each time.
When the build job invoked at midnight and it did not finished until 12:00AM,
first build and last build has different SCYLLA-RELEASE-FILE, since it contains
current date.
To prevent it, skip updating SCYLLA-*-FILE when git hash unchanged.
Fixesscylladb/scylla-pkg#826
Currently reader_concurrency_semaphore::signal() can fail. This is
dangerous in two ways:
* It is called from constructors, so the exception can bring down the
node. This will convert an `std::bad_alloc` to a crash.
* Reads in the queue will be blocked until they either time-out, or
another `signal()` succeeds.
To solve this, wrap the `reader_permit` constructor, the only code that
can throw, with try-catch and forward the exception to the reader
admission promise. In practice this will result in the flushing of the
reader queue, when we fail to admit a read.
Fixes#5741
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200206154238.707031-1-bdenes@scylladb.com>
This patch is a bag of fixes/cleanups that were omitted from the reader
memory tracking series due to contributor error. It contains the
following changes:
* Get rid of unused `increase()` and `decrease()` methods.
* Make all constructors and assignment operators `noexcept`.
* Make move assignment operator safe w.r.t. self assignment.
* `reset()`: consume the new amount before releasing the old amount,
to prevent a transient window where new readers might be admitted.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200206143007.633069-1-bdenes@scylladb.com>
Most of the time schema does not have to be copied and sometimes it's not even used.
tests: unit(dev)
Closes#5739
* hawk/remove_schema_copies:
multishard_mutation_query_test: stop capturing unused schema
index_reader: avoid copying schema to lambda
Merged patch series from Piotr Sarna:
This series adds a way to confgure alternator write isolation policy
per-table with the use of tags.
Instead of hardcoded LWT_ALWAYS policy, it can now be set by tagging
a table with a tag of the following form:
{
'Key': 'system:write_isolation',
'Value': X
},
where X is one of the following implemented levels:
* 'f' - forbid RMW
* 'a' - always enforce RMW
* 'o' - only RMW writes will go through LWT
* 'u' - unsafe RMW (to be deprecated/eradicated)
By default, if no tag is found, alternator falls back to always applying
LWT to writes.
This series also contains fixes to the tagging interface - some minor
issue came up while implementing write isolation config on top of tags.
test: alternator-test(local,remote)
Piotr Sarna (6):
alternator: return tags for a table via const reference
alternator: fix overwriting tags
alternator: make _write_isolation a protected attribute
alternator: add configuring write isolation policy via tags
alternator-test: add testing different write isolation policies
docs: update alternator on write isolation
alternator-test/test_condition_expression.py | 63 ++++++++++++++
alternator-test/test_tag.py | 25 ++++++
alternator/executor.cc | 89 +++++++++++++-------
docs/alternator/alternator.md | 21 +++--
4 files changed, 162 insertions(+), 36 deletions(-)
Merged pull request https://github.com/scylladb/scylla/pull/5733 from
Piotr Jastrzębski:
In many places we use global_partitioner() to obtain parameters that are
available in config. This PR replaces number of global_partitioner() calls
with equivalent non-global ways.
tests: unit(dev)
* 'reduce_global_usage' of github.com:haaawk/scylla:
storage_service: reduce number of global_partitioner calls
cdc: remove partitioner from db_context
gossiper: stop calling global_partitioner()
system_keyspace: stop calling global_partitioner()
transport/server: stop calling global_partitioner()
thrift: stop calling global_partitioner()
partitioner: move cpu_sharding_algorithm_name to token-sharding.hh
Until now, write isolation policy was hardcoded to always enforcing LWT.
From now on, setting a tag via UpdateTags request or during table
creation will associate a policy with given table.
The tag key is 'system:write_isolation' and its value can be one of:
* 'f' - forbid RMW
* 'a' - always enforce RMW
* 'o' - only RMW writes will go through LWT
* 'u' - unsafe RMW (to be deprecated/eradicated)
Tagging a resource with a tag key that already exists should result
in overwriting the old value. It wasn't the case, so it's now fixed
and an appropriate test is added.
The signature of the helper function is changed, so that it's possible
to acquire a const reference of the tags, instead of being forced
to get a copy of the whole map (potentially large).
Here is a rebase of some of my already-reviewed Alternator patches -
the final piece of the fix to LWT timestamps (in BatchWriteItems),
The "/localnodes" request, and a couple of patches reducing the number
of times that the global storage_proxy is needed.
Also available in a github branch, git@github.com:nyh/scylla.git series1
* nyh/series1:
redis: remove redundant code
storage_proxy: make it into a peering sharded service
alternator: use simpler API for registering Alternator's HTTP URLs
alternator-test: test "/localnodes" feature
alternator: add public API for list of nodes in current DC
alternator: use LWT timestamp - in BatchWriteItems too
Replace global_partitioner().sharding_ignore_msb() call
with config::murmur3_partitioner_ignore_msb_bits()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Sharding logic has been moved to token-sharding.hh some time ago.
This logic does not depend on partitioner any more so cpu_sharding_algorithm_name
can be safely moved to the header where rest of sharding logic lives.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We consider globals like service::get_storage_proxy() a bad idea,
and would like to reduce their use as much as possible - and eventually,
eliminate it completely.
One easy case to fix case is when we already have a shard-local proxy,
but now we need the sharded object, to invoke_on() something on it.
In this patch, we turn storage_proxy into a peering_sharded_service.
This means that if you already have a storage_proxy, you can call
its container() function to get the sharded<storage_proxy>, without
needing to call the global service::get_storage_proxy().
We found a few such cases in storage_proxy itself, and in Alternator,
and fixed them to use container() instead of the global function.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We used the Seastar HTTP server's add() method to register URLs to
serve (so-called "routes"), but as suggested by Amnon, when we have
fixed URLs without parameters being path components, it's simpler
to use the put() method to do the same thing - and also results in
slightly less work at run-time to look up these routes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This is a partial test for the "/localnodes" request, which is supposed to
return the list of live nodes in this DC. Because of the limitations of our
current alternator-test framework (which should work on any pre-existing
cluster), we don't know what to expects as a reply, but we just verify the
minimum: The request is understood, returns a JSON list, which contains
at least one item.
As "/localnodes" is a Scylla-only feature, this test is skipped when
running with "--aws".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
If we want to balance the Alternator request load among the different nodes
(Refs #5030), the load balancer - whether it uses HTTP load balancing or
DNS - needs to be able to get an up-to-date list of live nodes to which it
can direct Alternator traffic. This list should include only the live nodes
in the same data center (geographical region) - it is expected that a
separate load balancer will be installed in each data center, and clients
from within this data center will reach this data center's load balancer.
There are multiple APIs in current Scylla to do something similar to what
we need, but as far as I know, none of them is exactly what we need or
convenient for Alternator installations: We don't want the load balancer
to use CQL, and the REST API http://localhost:10000/gossiper/endpoint/live/
doesn't do what we need (it doesn't restrict the list to one data center)
plus it's not open to connections outside the machine.
So in this patch, we implement a new HTTP request on the Alternator port -
"/localnodes", returning a JSON-formatted list of all live nodes in the
contacted node's data center:
$ curl http://localhost:8000/localnodes
["127.0.0.2","127.0.0.1","127.0.0.3"]
Like the existing health check HTTP request, this operation is public and
unauthenticated. We consider the security risk low - it allows an attacker
to enquire the list of Scylla nodes in this DC, but an attacker can achieve
the same thing by just scanning the addresses in this subnet using the health
check request (or even with ordinary DynamoDB API requests).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
A previous patch fixed Alternator's writes to use the timestamp provided
by LWT instead of the current timestamp. That patch fixed the PutItem,
DeleteItem and UpdateItem operations - and this patch fixes the remaining
write operation: BatchWriteItems. So,
Fixes#5653.
Unfortunatly, the requirements of both BatchWriteItems and LWT make the
resulting code - and this patch - somewhat inelegant. BatchWriteItems
requires that we prepare all the operations first - failing if any of them
has an error. Before this patch, the result of this preparation was an
array of mutations, which in a second step we wrote to the database.
But we can no longer use mutations for the result of the first step,
because creating a mutation requires knowing the timestamp, which we don't
know during the preparate phase - we will only know it during the later
LWT operation. So now we need to invent a new intermediate format between
the request and the mutation. This intermediate format is further
complicated by the need to be send it between shards (for LWT's shard
forwarding) so it cannot, for example, contain a reference to a schema.
The fact that different sub-operations need to be sent to different shards,
and that different sub-operations may write to different tables, further
complicate the book-keeping and gives us a bunch of funky-typed maps.
But eventually it all fits together.
After this patch, as before this patch, the same code (now called
put_or_delete_item), is used to implement both the PutItem and DeleteItem
stand-alone operation, and the BachWriteItems operation which includes a
whole list of these PutItem and DeleteItem operation.
This patch also includes two more tests in test_batch.py, which test two
more corner tests we haven't tested before: One tests the capability of
BatchWriteItems to write to more than one table. The other tests that
BatchWriteItems can write an empty item (it is not surprising that it does,
but we do have special code for this case, so we should test it).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"
This series fixes an assertion when initialization fails after
creating a database. I don't know of a case where that currently
happens, but it is easy to cause that when writing a patch and the
produced assert is just confusing.
"
* 'espindola/dont-assert-on-init-error' of https://github.com/espindola/scylla:
db: Replace large_data_handler::_stopped with _running
db: Move nop_large_data_handler constructor out-of-line
db: Move large_data_handler::stop out-of-line
This tests fails when run on more than 1 core.
Tests: unit(dev)
* hawk/fix_memory_footprint:
memory_footprint_test: Make it clear it has to run with -c1
tests: move memory_footprint_test to perf/
"
After deprecating partitioners other than Murmur3 we can change the representation of tokens to int64_t. This will allow setting custom partitioner on each table. With this change partitioners become just converters from partition keys to tokens (int64_t). Following operations are no longer dependant on partitioner implementation:
- Tokens comparison
- Tokens serialization/deserialization to strings
- Tokens serialization/deserialization to bytes
- Sharding logic
- Random token generation
This change will be followed by a PR that enables per table partitioner and then another PR that introduces a special partitioner for CDC tables.
Tests: unit(dev)
Results of memory footprint test:
Differences:
in cache: 992 vs 984
in memtable: 750 vs 742
sizeof(cache_entry) = 112 vs 104
-- sizeof(decorated_key) = 36 vs 32
MASTER:
mutation footprint:
in cache: 992
in memtable: 750
in sstable: 351
frozen: 540
canonical: 827
query result: 342
sizeof(cache_entry) = 112
-- sizeof(decorated_key) = 36
-- sizeof(cache_link_type) = 32
-- sizeof(mutation_partition) = 96
-- -- sizeof(_static_row) = 8
-- -- sizeof(_rows) = 24
-- -- sizeof(_row_tombstones) = 40
sizeof(rows_entry) = 232
sizeof(lru_link_type) = 16
sizeof(deletable_row) = 168
sizeof(row) = 112
sizeof(atomic_cell_or_collection) = 8
THIS PATCHSET:
mutation footprint:
in cache: 984
in memtable: 742
in sstable: 351
frozen: 540
canonical: 827
query result: 342
sizeof(cache_entry) = 104
-- sizeof(decorated_key) = 32
-- sizeof(cache_link_type) = 32
-- sizeof(mutation_partition) = 96
-- -- sizeof(_static_row) = 8
-- -- sizeof(_rows) = 24
-- -- sizeof(_row_tombstones) = 40
sizeof(rows_entry) = 232
sizeof(lru_link_type) = 16
sizeof(deletable_row) = 168
sizeof(row) = 112
sizeof(atomic_cell_or_collection) = 8
"
* 'fixed_token_representation' of https://github.com/haaawk/scylla: (21 commits)
token: cast to int64_t not long in long_token
murmur3: move sharding logic to token and i_partitioner
partitioner: move shard_of_minimum_token to token
partitioner: remove token_to_bytes
partitioner: move get_token_validator to token
partitioner: merge tri_compare into dht::tri_compare
partitioner: move describe_ownership to token
partitioner: move from_bytes to token
partitioner: move from_string to token
partitioner: move to_sstring to token
partitioner: move get_random_token to token
partitioner: move midpoint function to token
token: remove token_view
sstables: use copy constructor for tokens
token: change _data to int64_t
partitioner: remove hash_large_token
token: change data to array<uint8_t, 8>
partitioner: Extract token to separate .hh and .cc files
partitioner: remove unused functions
Revert "dht/murmur3_partitioner: take private methods out of the class"
...
Since token representation is fixed now, all the partitioners
will share the sharding logic. It makes sense now to keep
the logic in common super class and separate header that's
included only in i_partitioner.cc.
shard_of and token_for_next_shard are now implemented in
i_partitioner. They would be non-virtual but we have to
keep them virtual because one test is overriding them
to enforce some specific sharding.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
i_partitioner::token_to_bytes is just a call to
token::data and does not depend on partitioner
at all. It is possible to convert token to bytes
without having access to partitioner.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Previously _data was stored as array of 8 bytes in
network byte order.
After this change it stores the same value in int64_t
in host byte order.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Now that token representation is always array<uint8_t, 8>,
hash<dht::token> will always pick
read_le<size_t>(reinterpret_cast<const char*>(b.data()))
and never call hash_large_token because the check
is always true b.size() == sizeof(size_t).
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It is save to do such change because we support only
Murmur3Partitioner which uses only tokens that are
8 bytes long.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This patch conflicts with the following patches.
The final effect is equivalent and it's easier to revert this patch
and cleanly apply already reviewed patches.
This reverts commit f4f8593bac.
This is not just a direct flip to a variable with the negated Boolean
value. When created, a large_data_handler is not considered to be
running, the user has to call start() before it can be used.
The advantaged of doing this is that if initialization fails and a
database is destructed before the large_data_handler is started, the
assert
database::stop() {
assert(!_large_data_handler->running());
is not triggered.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
This patch series fixes#5711, enables UDF support in CQL tests and
and includes a few extra cleanups.
"
* 'espindola/lua-fixes' of https://github.com/espindola/scylla:
lua: Use a negative index for consistency
lua: Fix returning list<decimal>
lua: Fix returning list<varint>
lua: Use a lua_slice_state instead of a from_lua_visitor
test: Enable UDF in the cql repl
The DynamoDB API doesn't have the notion of client-supplied timestamps,
so the server is supposed to use its own current timestamp for write
operations.
However, for LWT writes, we should not use this node's current time:
Different nodes may slightly differ in their clocks, and LWT needs
a monotonically-increasing notion of time for the consistent operations.
LWT provides to the operation's apply() method the specific timestamp that
it should use in its returned mutation - and we should use this timestamp,
not the current timestamp.
In the optional write modes where LWT is not used, we continue to use
the current timestamp (api::new_timestamp()) as before.
This patch fixes the PutItem, UpdateItem and DeleteItem operations.
The BatchWriteItem operation is not yet fixed by this patch - fixing
it will require more elaborate code changes so will be done in a
separate patch.
Refs #5653.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200130122853.7658-1-nyh@scylladb.com>
Today, we use LWT for all PutItem, UpdateItem and GetItem operations.
We do this even for pure writes - writes which do not involve a read
before the write).
But BatchWriteItem also does pure writes - and it doesn't use LWT yet.
So this patch changes it so it does. As before we keep in the code -
not yet configurable by a user - also the option to do these unconditional
writes without LWT.
A BatchWriteItem may change multiple partitions (but a fairly low number -
DynamoDB allows each BatchWriteItem to only do 25 updates) and we start the
different LWT operations in parallel.
This patch collects multiple mutations to the same partition together to
be done with a single LWT operation, so we also add a test for this case,
were we have a batch of writes involving several items in each of several
partitions.
Fixes#5637
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200128160538.11775-1-nyh@scylladb.com>
This produce code that is just as fast as the previous implementation
and is quite a bit easier to read IMHO.
I benchmarked it by temporally adding:
BOOST_AUTO_TEST_CASE(bench_maybe_quote) {
std::string val(1 << 20, 'x');
using clk = std::chrono::steady_clock;
cql3::util::maybe_quote(val);
auto start = clk::now();
for (int i = 0; i < 1000; ++i) {
cql3::util::maybe_quote(val);
}
auto end = clk::now();
std::chrono::duration<double> duration = end - start;
std::cout << "delta = " << duration.count() << '\n';
}
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200203225140.180262-1-espindola@scylladb.com>
* seastar 65980a9b30...30185fd901 (12):
> sstring: resize: NulTerminate when downsizing
> reactor: make open_flags::dsync respect --unsafe-bypass-fsync
> json/json_elements: Use double quotes around element name
> Revert "reactor: make open_flags::dsync respect --unsafe-bypass-fsync"
> Merge "smp: reduce allocations in work_item::process" from Avi
> task: optimize destruction by making destructor non-virtual
> reactor: make open_flags::dsync respect --unsafe-bypass-fsync
> Revert "sstring: resize: NulTerminate when downsizing"
> sstring: resize: NulTerminate when downsizing
> tests: Rename unix domain socket test for consistency
> resource: downgrade cgroupsv2 message.
> Merge "Simplify the stream/subscription implementation" from Rafael
Merged pull request https://github.com/scylladb/scylla/pull/5485
by Kamil Braun:
This series introduces the notion of CDC generations: sets of CDC streams
used by the cluster to choose partition keys for CDC log writes.
Each CDC generation begins operating at a specific time point, called the
generation's timestamp (cdc_streams_timestamp in the code).
It continues being used by all nodes in the cluster to generate log writes
until superseded by a new generation.
Generations are chosen so that CDC log writes are colocated with their
corresponding base table writes, i.e. their partition keys (which are CDC
stream identifiers picked from the generation operating at time of making
the write) fall into the same vnode and shard as the corresponding base
table write partition keys. Currently this is probabilistic and not 100%
of log writes will be colocated - this will change in future commits,
after per-table partitioners are implemented.
CDC generations are a global property of the cluster -- they don't depend
on any particular table's configuration. Therefore the old "CDC stream
description tables", which were specific to each CDC-enabled table,
were removed and replaced by a new, global description table inside the
system_distributed keyspace.
A new generation is introduced and supersedes the previous one whenever
we insert new tokens into the token ring, which breaks the colocation
property of the previous generation. The new generation is chosen to
account for the new tokens and restore colocation. This happens when a
new node joins the cluster.
The joining node is responsible for creating and informing other nodes
about the new CDC generation. It does that by serializing it and inserting
into an internal distributed table ("CDC topology description table").
If it fails the insert, it fails the joining process. It then announces
the generation to other nodes through gossip using the generation's
timestamp, which is the partition key of the inserted distributed table
entry.
Nodes that learn about the new generation through gossip attempt to
retrieve it from the distributed table. This might fail - for example,
if the node is partitioned away from all replicas that hold this
generation's table entry. In that case the node might stop accepting
writes, since it knows that it should send log entries to a new generation
of streams, but it doesn't know what the generation is. The node will keep
trying to retrieve the data in the background until it succeeds or sees
that it is no longer necessary (e.g., because yet another generation
superseded this one). So we give up some availability to achieve safety.
However, this solution is not completely safe (might break consistency
properties): if a node learns about a new generation too late (if gossip
doesn't reach this node in time), the node might send writes to the wrong
(old) generation. In the future we will introduce a transaction-based
approach where we will always make sure that all nodes receive the new
generation before any of them starts using it (and if it's impossible
e.g. due to a network partition, we will fail the bootstrap attempt).
In practice, if the admin makes sure that the cluster works correctly
before bootstrapping a new node, and a network partition doesn't start
in the few seconds window where a new generation is announced, everything
will work as it should.
After the learning node retrieves the generation, it inserts it into an
in-memory data structure called "CDC metadata". This structure is then
used when performing writes to the CDC log -- given the timestamp of the
written mutation, the data structure will return the CDC generation
operating at this time point. CDC metadata might reject the query for
two reasons: if the timestamp belongs to an earlier generation, which
most probably doesn't have the colocation property anymore, or if it is
picked too far away into the future, where we don't know if the current
generation won't be superseded by a different one (so we don't yet know
the set of streams that this log write should be sent to). If the client
uses server-generated timestamps, the query will never be rejected.
Clients can also use client-generated timestamps, but they must make sure
that their clocks are not too desynchronized with the database --
otherwise some or all of their writes to CDC-enabled tables will be
rejected.
In the case of rolling upgrade, where we restart nodes that were
previously running without CDC, we act a bit differently - there is no
naturally selected joining node which must propose a new generation.
We have to select such a node using other means. For this we use a bully
approach: every node compares its host id with host ids of other nodes
and if it finds that it has the greatest host id, it becomes responsible
for creating the first generation.
This change also fixes the way of choosing values of the "time" column
of CDC log writes: the timeuuid is chosen in a way which preserves
ordering of corresponding base table mutations (the timestamp of this
timeuuid is equal to the base table mutation timestamp).
Warning: if you were running a previous CDC version (without topology
change support), make sure to disable CDC on all tables before performing
the upgrade. This will drop the log data -- backup it if needed.
TODO in future patchset: expire CDC generations. Currently, each inserted
CDC generation will stay in the distributed tables forever (until
manually removed by the administrator). When a generation is superseded,
it should become "expired", and 24 hours after expiration, it should be
removed. The distributed tables (cdc_topology_description and
cdc_description) both have an "expired" column which can be used for
this purpose.
Unit tests: dev, debug, release
dtests (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/907/
Now that we bounce lwt requests to the correct shard before calling into
storage_proxy the cross shard op accounting does not account for bounced
lwt statement. Fix that by increasing corresponding counter when
returning a "bounce" reply.
Message-Id: <20200203122011.GH26048@scylladb.com>
Merged patch series from Benny Halevy:
The function was reimplemented to solve the following issues.
The cutom implementation also improved its performance in
close to 19%
Using regex_match("[a-z][a-z0-9_]*") may cause stack overflow on long input strings
as found with the limits_test.py:TestLimits.max_key_length_test dtest.
std::regex_replace does not replace in-place so no doubling of
quotes was actually done.
Add unit test that reproduces the crash without this fix
and tests various string patterns for correctness.
Note that defining the regex with std::regex::optimize
still ended up with stack overflow.
Fixes#5671
* cql3::util::maybe_quote: avoid stack overflow and fix quote doubling
* cql3::util::maybe_quote: further optimize quote doubling
Merged patch series from Gleb Natapov:
Batch statement can also execute LWT and hence need to handle
bounce_to_shard result.
* transport: handle bounce_to_shard for batch statement
* transport: consolidate bounce_to_shard handling between all three verbs that handle it
Command line options are printed out, so if a user cuts-and-pastes a
command line they will get a run that is more similar to the one that
the test executed.
Message-Id: <20200202133209.209608-1-avi@scylladb.com>
In this case we know the size of the stack and both indexes refer to
the same position. Using a negative index is just more consistent with
the rest of the file and hopefully a bit less brittle to future
changes.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We were accessing the wrong stack location if a decimal was not at top
of the stack.
Fixes: #5711
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We were accessing the wrong stack location if a varint was not at the
top of the stack.
Refs: #5711
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
A few places were using a from_lua_visitor only to access the
lua_slice_state member variable.
This is just a simplification. No functionality changed.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
All three verbs that need to handle bounce_to_shard have almost
identical process_*() and process_*_on_shard() functions. Consolidate
them into one to reuse the code.
The caller of check_knows_remote_features merges a set of
features into a string, but the method in question ... splits
them back into the set. Avoid this unneeded step and clean
the respective storage service helpers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are some places that get global storage_service instance
for individual features. In the next patch all these helpers
will be removed, so here's the preparation for it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's no local variable to get features from in the
create_view_statement constructor, but since the .validate
is always called after it, it looks safe to check for
needed feature in it (we have storage_proxy there).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Keep local feature_service reference on database. This relaxes the
circular storage_service <-> database reference, but not removes it
completely.
This needs some args tossing in apply_to_builder, but it's
rather straightforward, so comes in the same patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Keep reference on local feature service from storage_proxy
and use it in places that have (local) storage_proxy at hands.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This unties migration_manager from storage_service thus breaking
the circular dependency between these two.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage service no longer needs to mess with feature
config. It only needs two features to register onself in,
but this can be solved by respective cluster_supports_foo
helpers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now features are registered into a map of vectors, but
it looks like the vector is always 1-item long and is
used to keep pointer on feature, instead of the feature
itself.
Switch it into map of reference_wrapper-s.
Before this patch we could register more than one
feature under the same name, now we can't. But this
seems to be OK, as we don't actually do this. To catch
violations of this restriction there's an assert() in the
feature_service::register_feature.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two masks -- supported and known. They differ in
unbounded_range_tombstones one which is set depending on the
sstables format in use.
Since the feature_service doesn't know anything about sstables
format, the logic is reverted -- the feature service reports
back the known mask (all features) and storage_service clears
the unbounded_range_tombstones if the sst format is low -- but
is (hopefully) left intact.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And leave some temporary storage_service->feature links. The plan
is to make every subsystem that needs storage_service for features
stop doing so and switch on the feature_service.
The feature_service is the service w/o any dependencies, it will be
freed last, thus making the service dependency tree be a tree,
not a graph with loops.
While at it -- make all const-s not have _FEATURE suffix (now there
are both options) and const-qualify cluster_supports_lwt().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some features take db::config to find out whether to be enabled
or disabled. This creates unwanted dependency between database and
features, so split the features configuration explicitly. Also
this will make the "this is for testing env only" logic cleaner
and simpler to understand.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The _disabled_features is configured by tests via storage_service
constructor, so the helper in question is effectively useless.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If we are a seed node (but not the only one) or we set
auto_bootstrap=off, it might happen due to misconfiguration or a network
partition that we don't know other nodes' tokens at the end of the
join_token_ring function, when we go into the NORMAL status, finishing
the joining process.
CDC however requires that we know other nodes' tokens at this point:
we need them to correctly create a new CDC generation.
This commit adds a check which prevents the node from starting if that's
not the case. If the check fails, the node first tries waiting a bit until
it learns about the tokens or timeouts.
"
The fix itself is fairly simple, but looking at the code I found that
our code base was not cleanly distinguishing null and empty values and
was treating null and missing values differently, but that distinction
was dead since a null is represented as a dead cell.
"
* 'espindola/lua-fix-null-v6' of https://github.com/espindola/scylla:
lua: Handle nil returns correctly
types: Return bytes_opt from data_value::serialize
query-result-set: Assert that we don't have null values
types: Fix comparison of empty and null data_values
Revert "tests: Handle null and not present values differently"
query-result-set: Avoid a copy during construction
types: Move operator== for data_value out-of-line
"
In a few places, the only use we had for a subscription was calling
done(). With this series we now call done() early and store the
future<> instead.
"
* 'espindola/stream-cleanup' of https://github.com/espindola/scylla:
sstable_test: Store a future<> instead of a subscription
commitlog: Store a future instead of a subscription in db::commitlog::segment_manager::list_descriptors::helper
lister: Store a future<> instead of a subscription
With this change we always rebuild seastar/libseastar_testing.a for
the same reason we always rebuild seastar/libseastar.a: We have no
idea what its dependencies are, we have to recurse to seastar to find
out.
The other missing dependency is that we have to rebuild build.ninja
when seastar/CMakeLists.txt changes. A change in
seastar/CMakeLists.txt can cause seastar.pc to change which can change
the command lines used.
That is incomplete as change other seastar files can have the same
impact, but it is better than nothing.
It is not sufficient to put a dependency in the seastar.pc file as
that file will be modified when cmake is run and the scylla ninja
process doesn't see the CMakeLists.txt to seastar.pc edge.
Fixes: #5687
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200201001126.458992-1-espindola@scylladb.com>
This unregistration doesn't happen currently, but doesn't seem to
cause any problems in general, as on stop gossiper is stopped and
nothing from it hits the store_service.
However (!) if an exception pops up between the storage_service
is subscribed on gossiper and the drain_on_shutdown defer action
is set up then we _may_ get into the following situation:
- main's stuff gets unrolled back
- gossiper is not stopped (drain_on_shutdown defer is not set up)
- migration manager is stopped (with deferred action in main)
- a nitification comes from gossiper
-> storage_service::on_change might want to pull schema with
the help of local migration manager
-> assert(local_is_initialized) strikes
Fix this by registering storage_service to gossiper a bit earlier
(both are already initialized y that time) and setting up unregister
defer right afterwards.
Test: unit(dev), manual start-stop
Bug: #5628
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200130190343.25656-1-xemul@scylladb.com>
We use eventually() in tests to wait for eventually consistent data
to become consistent. However, we see spurious failures indicating
that we wait too little.
Increasing the timeout has a negative side effect in that tests that
fail will now take longer to do so. However, this negative side effect
is negligible to false-positive failures, since they throw away large
test efforts and sometimes require a person to investigate the problem,
only to conclude it is a false positive.
This patch therefore makes eventually() more patient, by a factor of
32.
Fixes#4707.
Message-Id: <20200130162745.45569-1-avi@scylladb.com>
No-one seems to invoke this method. Instead, clients invoke
restriction::values (note singular "restriction"). Most subclasses of
restrictions also inherit from restriction, so values() still exists
in their public interface.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
Benny pointed out that we could avoid a branch inside a loop is the
old serialization code. That got me looking at the logic and I found
that it would also produce an unnecessary 0xff prefix for some
negative numbers.
This patch series fixes the serialization and optimizes it. It now
does no extra copies for positives numbers and only one extra copy for
negative numbers, which I think is optimal since cpp_int uses sign
magnitude and we want the 2 complement representation.
"
* 'espindola/serialize_varint-improvements-v2' of https://github.com/espindola/scylla:
types: Use a fancy iterator to avoid a temporary buffer
types: Use export_bits to serialize cpp_int
types: Avoid a branch in a loop
types: Fix encoding of negative varint
types: Replace "num.sign() < 0" with "num < 0"
By using a fancy iterator we can avoid calling export_bits with a
temporary buffer before copying the result to the output.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We would sometimes produce an unnecessary extra 0xff prefix byte.
The new encoding matches what cassandra does.
This was both a efficiency and correctness issue, as using varint in a
key could produce different tokens.
Fixes#5656
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The only use we had for the subscription was calling done, may as well
call it early and store the future<>.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The only use we had for the subscription was calling done, may as well
call it early and store the future<>.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The only use we had for the subscription was calling done, may as well
call it early and store the future<>.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Treat writes to local.paxos as user memory, as the number of writes is
dependent on the amount of user data written with LWT.
Fixes#5682
Message-Id: <20200130150048.GW26048@scylladb.com>
This set implements support for per scheduling group statistics in
storage proxy and tables view statistics (although tables view per
scheduling group stats are not actively applied in this series).
Having those statistics per scheduling group can help in finding operations
that are performed outside their context, another advantage is that
it lays the land for supporting per service level statistics for the
workload prioritization enterprise feature.
At some point there was a thought to add those stats per role but
for now it is not feasible at the moment:
1. The number of roles/user is unbounded so it is dangerous to
hold stats (in memory) for all of them.
2. We will need a proper design of how to deal with the hierarchical
nature of roles in the stats.
Besides these reasons and regardless, it is beneficial to look on
resource related stats per scheduling group, looking at resources
per user or role will not necessarily give insights since resources
are divided per sg and not role, so it can lead to false conclusions
if more than one role is attached to the same service level.
Tests:
unit tests (Dev, Debug)
validating the stats with monitor
* es/per_sg_stats/v6:
storage proxy: migrate to per scheduling group statistics
internalize storage proxy statistics metric registration
This commit builds on top of the introduced per scheduling group
statistics template and employs it for achieving a per scheduling
group statistics in storage_proxy.
Some of the statistics also had meaning as a global - per
shard one. Those are the ones for determining if to
throttle the write request. This was handled by creating a
global stats struct that will hold those stats and by changing
the stat update to also include the global one.
One point that complicated it is an already existing aggregation
over the per shard stats that now became a per scheduling group
per shard stats, converting the aggregation to a two-dimensional
aggregation.
One thing this commit doesn't handle is validating that an individual
statistic didn't "cross a scheduling group boundary", such validation
is possible but it can easily be added in the future. There is a
subtlety to doing so since if the operation did cross to other
scheduling group two connected statistics can lose balance
for example written bytes and completed write transactions.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
The storage proxy statistics structure did not contain
a method for registering the statistics for metric
groups, instead, each user had to register some
of the metrics by itself. There is no real reason
for separating the metrics registration from
the statistics data. There is even less justification
for doing this only for part of the stats as is
the case for those statistics.
This commit internalize the metrics registration
in the storage_proxy stats structures.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Avoid string copies when doubling quotes in the string
by counting them when scanning the input string and
reserving the required space when making the result std::string.
This showed a performance improvement of ~1.8% when
running the maybe_quote unit test in tight loop
(w/ the shorter strings only)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
gcc 10 tightened its C++ includes to no longer provide ssize_t,
so we must get it from a C header instead.
Message-Id: <20200129205912.21139-1-avi@scylladb.com>
gcc 10 requires a semicolon after every compound requirement,
as per the standard. Add missing semicolons where necessary.
Message-Id: <20200129205805.20928-1-avi@scylladb.com>
To increase modularity, making it easier to find what is where and
maintain.
The 'log' module (cdc/log.{hh,cc}) is responsible for updating CDC log
tables when base table writes are performed.
The 'generation' module (cdc/generation.{hh,cc}) handles stream
generation changes in response to topology change events.
cdc/metadata.{hh,cc} contains a helper class which holds the currently
used generation of streams. It is used by both aforementioned modules:
'log' queries it, while 'generation' updates it.
Snitch forms a class hierarchy which get_shard_count and
get_ignore_msb_bits ignore (their returned values only depend on the
gossiper's state).
Besides, these functions just don't belong there.
Snitch has nothing to do with shard_count or ignore_msb_bits.
Change the CDC code to use the global CDC stream generations.
The per-base-table CDC description table was removed. The code instead
uses cdc::metadata which is updated on gossip events.
The per-table description tables were replaced by a global description
table to be used by clients when searching for streams.
When a node learns that another node joins the cluster (or begins
the joining process, i.e. bootstrap), it will read the CDC generation
timestamp proposed by that node, use it to retrieve the generation from the
distributed generations table, and save it in its local generation queue
to be used for writing to the CDC log when its local clock crosses
the generation's timestamp.
The CDC generation is saved in the queue before tokens are saved in
token_metadata. This is important so that when the node becomes
a coordinator of a write, it will already have all the necessary
information required to generate a corresponding CDC log mutation.
After joining, nodes should keep gossiping their proposed stream
generation timestamps forever, until they learn about a newer timestamp,
in which case they'll start gossiping the new timestamp.
There is one case where a node won't gossip such any generation timestamp:
if it's upgrading from a non-CDC version.
In this situation we make one of the nodes begin the first generation.
The class stores a queue of CDC generations to be used for choosing
streams when writing to the CDC log.
This data structure will be updated on some gossip events (when a new node
joins the cluster and proposes a new generation of CDC streams).
In future commits this will be used by nodes learning about other nodes
entering NORMAL status. The joining node proposes a new generation of streams,
whose timestamp is gossiped by the node.
Generate a new generation of streams during bootstrap,
insert it into an internal distributed table for other nodes to read
and save its timestamp in the system.local table.
When restarting, read the generation timestamp from the system.local table.
Gossip the generation timestamp.
This will be used to persist CDC streams generation timestamp
proposed by a joining node in case the node crashes or restarts,
similarly to the way tokens are persisted.
The get_saved_cdc_streams_timestamp method retrieves the generation
timestamp from the system table. It will be used by a restarting
node.
The update_cdc_streams_timestamp method saves CDC stream
generation timestamp of the calling node in the system table.
A joining node will persist the timestamp before it proposes it to other
nodes.
The cdc_topology_description table will be used internally
by nodes to send new CDC stream generations to other nodes.
The cdc_description table is a user-facing table,
used to inform users about new sets of CDC streams.
Regenerate sstables and digests for schema_change_test.
We don't need to protect this change by a schema feature:
when a node creates these tables, it announces them
to all other nodes. If schema agreement happens before
this migration, all nodes will use a digest calculated
without these tables. If it happens after, then all nodes
will eventually know about these tables and use a digest
calculated with these tables.
cdc::topology_description describes a mapping of tokens to CDC streams.
The cdc::generate_topology_description function is given:
1. a set of tokens which split the token ring into token ranges (vnodes),
2. information on how each token range is distributed among its owning
node's shards
and tries to generate a set of CDC stream identifiers such that for each
shard and vnode pair there exists a stream whose token falls into this
vnode and is owned by this shard.
It then builds a cdc::topology_description which maps tokens to these
found stream identifiers, such that if token T is owned by shard S in
vnode V, it gets mapped to the stream identifier generated for (S, V).
This is a class that will be used for storing information
required to perform CDC operations, i.e. assignment of token ranges to
CDC streams.
It is serializable to bytes and will be stored
in such a form in a distributed table accessible
by all nodes.
Allows calculating the shard of the given token using custom values of
shard_count and sharding_ignore_msb (instead of the ones used by the
particular partitioner instance).
The function was reimplemented to solve the following issues.
The cutom implementation also improved its performance in
close to 19%
Using regex_match("[a-z][a-z0-9_]*") may cause stack overflow on long input strings
as found with the limits_test.py:TestLimits.max_key_length_test dtest.
std::regex_replace does not replace in-place so no doubling of
quotes was actually done.
Add unit test that reproduces the crash without this fix
and tests various string patterns for correctness.
Note that defining the regex with std::regex::optimize
still ended up with stack overflow.
Fixes#5671
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
test.py returns -1 on failure; exit() translates that to 255, which git
bisect interprets as a special exit code requiring manual intervention.
Change to return the more traditional 1 on failure, which git bisect
can interpret as a normal failure condition.
Message-Id: <20200130084950.4186598-1-avi@scylladb.com>
With this patch lua nil values are mapped to CQL null values instead
of producing an error.
Fixes#5667
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Since a data_value can contain a null value, returning bytes from
serialize() was losing information as it was mapping null to empty.
This also introduces a serialize_nonnull that still returns bytes, but
results in an internal error if called with a null value.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Null values are represented with dead cells and never included in a
result_set. To enforce that, this adds a non_null_data_value that
wraps a data_value and whose constructor calls on_internal_error if a
null data_value is passed.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Before this patch a null data_value would compare equal to any
data_value that serialized to an empty byte sequence.
With this patch null only compares equal to null.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This reverts commit 2ebd1463b2.
The test introduced by that commit was wrong, and in fact depended on
a bug in operator== for data_value. A followup patch fixes operator==,
so this reverts the broken commit first.
The reason it was broken was that it created a live cell with a null
data_value. In reality, null values are represented with dead cells.
For example, the sstable produced by
CREATE TABLE my_table (key int PRIMARY KEY, v1 int, v2 int) with compression = {'sstable_compression': ''};
INSERT INTO my_table (key, v1, v2) VALUES (1, 42, null);
Is
00 04 key_length
00 00 00 01 key
7f ff ff ff local_deletion_time
80 00 00 00 00 00 00 00 marked_for_delete_at
24 HAS_ALL_COLUMNS | HAS_TIMESTAMP
09 row_body_size
12 prev_unfiltered_size
00 delta_timestamp
08 USE_ROW_TIMESTAMP_MASK
00 00 00 2a value
0d USE_ROW_TIMESTAMP_MASK | HAS_EMPTY_VALUE_MASK | IS_DELETED_MASK
00 deletion time
01 END_OF_PARTITION
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Merged patch series from Piotr Sarna:
This series adds the following to alternator:
- TagResource request
- UntagResource request
- ListTagsOfResource request
- Honoring "Tags" parameter in CreateTable
It also provides more tests for above features and extended docs.
Tagging is backed by a schema extension, which is in turn backed
by entries in system_schema.tables.extensions map.
Tags are considered part of the schema, and in particular
they are updated via an equivalent of:
ALTER TABLE table WITH scylla_tags = {'key1':'v1', 'key2':'v2'}
Each tag change is therefore a schema change, which also means
that editing tags for the same table on different nodes may be
subject to races, until the schema agreement issues are resolved
in Scylla.
Fixes#5066
Tests: alternator-test(local, remote)
Piotr Sarna (6):
alternator,main: add tags schema extension
alternator: add creating values from string views
alternator: implement tagging
alternator: allow tagging on table creation
docs: add entries for alternator tags and arn
alternator-test: make test tables case sensitive
alternator-test/test_tag.py | 63 ++++++++++-
alternator-test/util.py | 2 +-
alternator/executor.cc | 191 ++++++++++++++++++++++++++++++++--
alternator/executor.hh | 3 +
alternator/rjson.cc | 4 +
alternator/rjson.hh | 1 +
alternator/server.cc | 3 +
alternator/tags_extension.hh | 52 +++++++++
docs/alternator/alternator.md | 14 ++-
main.cc | 5 +
10 files changed, 325 insertions(+), 13 deletions(-)
create mode 100644 alternator/tags_extension.hh
Attempting to apply timed-out writes is a wasted effort. The coordinator
have already given up on the write and reported it as failed to the
client. Any cycles spent on this write is a waste at this point.
We currently only check the timeout if the write is blocked on memory,
otherwise, if the system is not under pressure, we will happily apply
timed out writes. If the system is under pressure we will make it worse
by wasting cycles on processing a timed out write.
Prevent this by checking the timeout as early as possible in
`database::apply()` and `database::apply_counter_update()`.
This patch doesn't solve all our problems related to timed out writes.
They can still sit and accumulate in various queues without expiring, a
prominent example being the smp queues. It is however a good first step
towards reducing wasted effort spent on them.
Refs: #5055
Ref #5251
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200129093007.550250-1-bdenes@scylladb.com>
After 546556b71b we can have mixed writes into commitlog,
some do flush immediately some do not. If non flushing write races with
flushing one and becomes responsible for writing back its buffer into a
file flush will be skipped which will cause assert in batch_cycle() to
trigger since flush position will not be advanced. Fix that by checking
that flush was skipped and in this case flush explicitly our file
position.
Fixes#5670
Message-Id: <20200128145103.GI26048@scylladb.com>
During table creation, it's now possible to provide a 'Tags' parameter,
which will add tags to a newly created table. Note that creating a table
and tagging it is not atomic, so in case of failure it's possible to end
up with a created table, but without appropriate tags.
This commit comes with a test.
Message-Id: <00c2e202e9075d2c61e4ee5ba322ff4d5dbe718c.1579618972.git.sarna@scylladb.com>
A schema extension is introduced for alternator - tags.
This schema extension can be used to store arbitrary tags for a table,
in the form of a map<text, text>.
Updating tags for a table is equivalent to the following CQL query:
ALTER TABLE table WITH scylla_tags = {'key1':'v1', 'key2':'v2'}
The extension, as all other extensions, is backed by the entry
in the system_schema.tables table.
Add "const" attributes to `assignment_testable::test_assignment`
and `term::raw::prepare` methods. These should have been marked as
"const" even before the change but for some reason were missing
these qualifiers.
Mark other supplementary methods with "const" attributes as
necessary.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200127213215.494000-1-pa.solodovnikov@scylladb.com>
"
Throw an error in case we hit an invalid time UUID
rather than hitting an assert.
Fixes#5552
(Ref #5588 that was dequeued and fixed here)
Test: UUID_test, cql_query_test(debug)
"
* 'validate-time-uuid' of https://github.com/bhalevy/scylla:
cql3: abstract_function_selector: provide assignment_testable_source_context
test: cql_query_test: add time uuid validation tests
cql3: time_uuid_fcts: validate timestamp arg
cql3: make_max_timeuuid_fct: delete outdated FIXME comment
cql3: time_uuid_fcts: validate time UUID
test: UUID_test: add tests for time uuid
utils: UUID: create_time assert nanos_since validity
utils/UUID_gen: make_nanos_since
utils: UUID: assert UUID.is_timestamp
"
This PR makes named_value respect allowed_values and then use it to transition away from old deprecated RandomPartitioner and ByteOrderedPartitioner. Then it removes the code that's no longer used.
We want to remove deprecated partitioners because, on one hand, they lead to performance problems and hot nodes. Moreover, we're planning to unify the token representation which would allow per table partitioner support. That, in turn, is a feature helpful in multiple efforts like CDC, materialized views, secondary indexes and multi-tenancy.
tests: unit(dev)
"
* 'remove_deprecated_partitioners' of https://github.com/haaawk/scylla:
partitioners: remove random_partitioner
partitioners: Make it impossible to use RandomPartitioner
partitioners: remove byte_ordered_partitioner
partitioners: Make it impossible to use ByteOrderedPartitioner
partitioners: Remove leftovers of OrderPreservingPartitioner
i_partitioner.cc: stop including byte_ordered_partitioner.hh
i_partitioner.cc: stop including random_partitioner.hh
config: use allowed_values to verify named_value input
config: add operator<< for seed_provider_type
Since we now default to lld if present, and since lld is a faster
linker than either ld or gold, it makes sense to install it
as a dependency and to make it available as part of the frozen
toolchain.
"
Grab the lowest hanging fruits.
This patch-set makes three important changes:
* Consume the memory for I/O operations on tracked files, *before* they
are forwarded to the underlying file.
* Track memory consumed by buffers created for parsing in
`continuous_data_consumer`. As this is the basis for the data, index
and promoted index parsers, all three are covered now in this regard.
* Track the index file.
The remaining, not-so-low handing fruits in order of
gain/cost(performance) ratio:
* Track in-memory index lists.
* Track in-memory promoted index blocks.
* Track reader buffer memory.
Note that this ordering might change based on the workload and other
environmental factors.
Also included in this series is an infrastructure refactoring to make
tracking memory easier and involve including lighter headers, as well as
a manual test designed to allow testing and experimenting with the
effects of changes to the accuracy of the tracking of reader memory
consumption.
Refs: #4176
Refs: #2778
Tests: unit(dev), manual(sstable_scan_footprint_test)
The latter was run as:
build/dev/test/manual/sstable_scan_footprint_test -c1 -m2G --reads=4000
--read-concurrency=1 --logger-log-level test=trace --collect-stats
--stats-period-ms=20
This will trickle reads until the semaphore blocks, then wait until the
wait queue drains before sending new reads. This way we are not testing
the effectiveness of the pre-admission estimation (which is terribly
optimistic) and instead check that with slowly ramping up read load the
semaphore will block on memory preventing OOM.
This now runs to completion without a single `std::bad_alloc`. The read
concurrency semaphore allows between 15-30 reads, and is always blocked
on memory.
"
* 'more-accurate-reader-resource-tracking/v1' of ssh://github.com/denesb/scylla:
test/manual/sstable_scan_footprint_test: improve memory consumption diagnostics
tests/manual/sstable_scan_footprint_test: use the semaphore to determine read rate
tests/manual: Add test measuring memory demand of concurrent sstable reads
index_reader: make the index file tracked
sstables/continuous_data_consumer: track buffers used for parsing
reader_concurrency_semaphore: tracking_file_impl: consume memory speculatively
reader_concurrency_semaphore: bye reader_resource_tracker
treewide: replace reader_resource_tracer with reader_permit
reader_permit: expose make_tracked_temporary_buffer()
reader_permit: introduce make_tracked_file()
reader_permit: introduce memory_units
reader_concurrency_semaphore: mv reader_resources and reader_permit to reader_permit.hh
reader_concurrency_semaphore: reader_permit: make it a value type
reader_concurrency_semaphore: s/resources/reader_resources/
reader_concurrency_semaphore::reader_permit: move methods out-of-line
In Scylla all query processing activity should run under the
"statement" scheduling group. The scheduling group is
important for maintaining the balance between background and
foreground tasks in Scylla.
Testing: In order to test the correctness of the patch.
First, the following assert was inserted before any call
to one of the executor functions in the http route:
assert(current_scheduling_group().name() == "statement"
Then all alternator tests ran and passed.
The second stage was to change the name so the assert
will fail:
assert(current_scheduling_group().name() == "no-statement"
And ran the tests again - validating that Scylla coredumps.
The asserts were then removed.
Fixes#5008
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20200127154341.10020-1-eliransin@scylladb.com>
"
There seem to be two problems with handling snapshot API -- one
on start and the other one on stop. Here's the set that addresses
both.
The fix moved snapshot API registration later in time that required
Amnon's ACK. Now we have it :) so -- the rebase and resend.
Tests: unit(dev), start-stop
"
* 'br-snapshot-bugs-2' of https://github.com/xemul/scylla:
snapshot: Pass requests through gate
api: Register snapshot API later
api: Unwrap wrap_ks_cf
"
With these changes and a binutils compiled with
--enable-deterministic-archives, the only difference I get in the
build directory if I build scylla twice from scratch are:
* The various CMakeError.log because they have temporary file names.
* The various CMakeOutput.log for the same reason.
* .ninja_log and .ninja_deps. I am not sure what the contents are.
"
* 'espindola/fix-determinism' of https://github.com/espindola/scylla:
build: remove timestamps from then antlr output
build: Make the output of idl-compiler deterministic
This depends on the patch
mk: avoid combining -r and -export-dynamic linker options
being added to dpdk.
I benchmarked this on top of my patches to get a reproducible build. I
first compiled with ccache, deleted the build directory and recompiled
so that all the "gcc -c" invocations were served by ccache. The times
of the second "ninja release" invocations were:
lld:
ninja release 155.68s user 71.89s system 2077% cpu 10.953 total
gold:
ninja release 953.79s user 254.71s system 2533% cpu 47.699 total
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200127171516.26268-1-espindola@scylladb.com>
> memory: add scoped_heap_profiling
> build: add switch to enable heap profiling support
> io_tester: do not abort on end of test
> resource: clean up cgroups version determination.
> prometheus: Silence a bogus gcc warning in http server
> Update dpdk submodule
> resource: Support cgroups v2
> net: Don't use variable length arrays
> core/memory.hh: document set_heap_profiling_enabled()
> Revert "net: Don't use variable length arrays"
> cmake: fix pkgconfig boost deps
> thread: Avoid confusing comment by switching value
> net: posix-stack: fix allocator in ap listening sockets
> net: posix-stack: fix passing allocator to new sockets
> stall_detector: Add a counter for stall detector report
> Merge "Don't use variable length arrays" from Rafael
> treewide: fix minor issues reported by clang
> thread: Call mprotect in make_stack
> thread: Always allocate stack with aligned_alloc
> build: Make SEASTAR_THREAD_STACK_GUARDS private
> thread: Move code out of a header
Merged patch series from Konstantin Osipov:
This series sets cql_repl core count to 1 and adds LWT
unit tests.
test.py: invoke cql_repl with smp=1
lwt: add lightweight transactions unit tests
Merged patch series from Piotr Sarna:
In order to minimize the usage of throws and catches in code paths
that are potentially hot, these paths instead return appropriate errors
directly.
The server layer is still able to catch and translate errors,
but the preferred way is to return api_error directly in places
that may be performance-sensitive.
Tests: alternator-test(local)
Fixes#5472
Piotr Sarna (3):
alternator: change request return type to variant<value, error>
alternator: elide throwing in condition checks
alternator: replace top-level throws with returns in executor
alternator/executor.hh | 28 ++++----
alternator/server.hh | 4 +-
alternator/executor.cc | 141 +++++++++++++++++++++--------------------
alternator/server.cc | 44 ++++++++-----
4 files changed, 117 insertions(+), 100 deletions(-)
Exclude cql_repl from the list of tests, since it's not a test.
Build it as a separate app. Do not strip, so that any CQL test
failure is easy to debug without a rebuild.
All test-related targets are converted from lists to sets to avoid
quadratic lookup cost in the check inside the loop which creates the
ninja file.
Currently --ami does not check instance types, creates invalid
io_properties.yaml on unsupported instance types.
It actually won't occur on AMI startup, since scylla_ami_setup only
invoke scylla_io_setup --ami when the instance is supported, so we don't
get the issue on startup, but we still get when we run scylla_io_setup
manually.
It's better to check instance type on scylla_io_setup, too.
Refs #5438
Conditional updates inform the user that the condition is not met
by returning an error. An initial implementation was based on rethrowing
these errors, but returning them directly is considered better
for performance.
This will prevent contention in case of parallel updates of the same row
by the same coordinator. The patch does it by introducing a new per key
lock map and taking it before running PAXOS protocol (either for write
of for read).
Message-Id: <20200117101228.GA14816@scylladb.com>
In order to minimize the use of exceptions during normal operations,
each request handler is now able to return either a proper JSON value,
or an instance of api_error, which indicates that something went wrong,
but without having to throw, catch and rethrow C++ exceptions.
This is especially important for conditional updates, since it's
expected to be common to return ConditionalCheckFailedException.
Message-Id: <d8996a0a270eb0d9db8fdcfb7046930b96781e69.1579515640.git.sarna@scylladb.com>
Docker restricts the number of processes in a container to some
limit it calculates. This limit turns out to be too low on large
machines, since we run multiple links in parallel, and each link
runs many threads.
Remove the limit by specifying --pids-limit -1. Since dbuild is
meant to provide a build environment, not a security barrier,
this is okay (the container is still restricted by host limits).
I checked that --pids-limit is supported by old versions of
docker and by podman.
Fixes#5651.
Message-Id: <20200127090807.3528561-1-avi@scylladb.com>
"
Final set of changes for full scan metrics.
- allow filtering
- full scan (Note: non-system tables only)
- full scan without BYPASS CACHE option
- tests for all metrics (bypass cache, allow filtering, full scan)
- works with prepared statements (tested, too)
"
* 'as_full_scan_metrics' of https://github.com/alecco/scylla:
Range scan query counter
Counter of queries doing full scan.
ALLOW FILTERING query counter
This test is all about tracking measured memory consumption vs. real
memory consumption. To make this easier add additional diagnostics:
* enable seastar heap profiler for the duration of the reads (seastar
has to be compiled with `-DSEASTAR_HEAPPROF`).
* Add a stats collector, which periodically collects stats such as
non-LSA free/used memory, LSA free/used memory and memory tracked by
the reader concurrency semaphore. These stats are written to a `.csv`
file, allowing importing them into a spreadsheet and processing them.
Currently the test fires the configured amount of reads at once. This is
somewhat restricting in the number of testable scenarios. For example,
it doesn't allow one to see if the semaphore correctly tracks the memory
consumption of existing reads, by firing new reads after a while.
Replace this algorithm by one which fires reads with a configured
concurrency, then waits for the semaphore's queue (if any) to drain,
before firing new reads. The test can now be configured with the total
amount of reads to fire, and with the read-concurrency, i.e. the number
of reads to fire at once in each iteration.
This allows for much greater flexibility in the different test
scenarios. The previous behaviour can still be achieved by configuring
a concurrency of 100.
This patch also adds better error handling. Reads are aborted on the
first error and errors are caught and not allowed to bubble up past the
test's main function and are logged instead.
Extensive logging is also added to be able to monitor the system while
the test is running.
Allow manual experimentation with the effectiveness of the accuracy of
the tracking of the resource consumption of readers, and hence the
system's ability to prevent overload and the dreaded `std::bad_alloc`.
This patch was originally developed by
Tomasz Grabiec <tgrabiec@scylladb.com>, I only adapted it to compile and
link on current master.
Based on heap profiling, buffers used for storing half-parsed fields are
a major contributor to the overall memory consumption of reads. This
memory was completely "under the radar" before. Track it by using
tracked `temporary_buffer` instances everywhere in
`continuous_data_consumer`. As `continuous_data_consumer` is the basis
for parsing all index and data files, adding the tracing here
automatically covers all data, index and promoted index parsing.
I'm almost convinced that there is a better place to store the `permit`
then the three places now, but so far I was unable to completely
decipher the our data/index file parsing class hierarchy.
Consume the memory before even submitting the I/O to the underlying
`file` object. This is in line with the underlying `file` object
allocating the buffer before it forwards the I/O request to the kernel.
This extends the "visibility" over the memory consumed by I/O greatly,
as it turns out buffers spend most time alive waiting for the I/O to
complete and are parsed shortly afterwards.
The former was never really more than a reader_permit with one
additional method. Currently using it doesn't even save one from any
includes. Now that readers will be using reader_permit we would have to
pass down both to mutation_source. Instead get rid of
reader_resource_tracker and just use reader_permit. Instead of making it
a last and optional parameter that is easy to ignore, make it a
first class parameter, right after schema, to signify that permits are
now a prominent part of the reader API.
This -- mostly mechanical -- patch essentially refactors mutation_source
to ask for the reader_permit instead of reader_resource_tracking and
updates all usage sites.
Previously `tracking_file_impl::make_tracked_buf()`. In the next patches
we plan on using this outside `tracking_file_impl`, so make it public
and templatize on the char type.
Similar to `seastar::semaphore_units`, this allows consuming and
releasing memory via an RAII object. In addition to that, it also allows
tracking changing values. This feature was designed to be used for
tracking the ever changing memory consumption of the buffers of
`flat_mutation_reader`:s.
This is now the only supported way of consuming memory from a permit.
In the next patches we will replace `reader_resource_tracker` and have
code use the `reader_permit` directly. In subsequent patches, the
`reader_permit` will get even more usages as we attempt to make the
tracking of reader resource more accurate by tracking more parts of it.
So the grand plan is that the current `reader_concurrency_semaphore.hh`
is split into two headers:
* `reader_concurrency_semaphore.hh` - containing the semaphore proper.
* `reader_permit.hh` - a very lightweight header, to be used by
components which only want to track various parts of the resource
consumption of reads.
Currently `reader_permit` is passed around as
`lw_shared_ptr<reader_permit>`, which is clunky to write and use and is
also an unnecessary leak of details on how permit ownership is managed.
Make `reader_permit` a simple value type, making it a little bit easier
and safer to use.
In the next patches we will get rid of `reader_resource_tracker` and
instead have code use the permit instance directly, so this small
improvement in usability will go a long way towards preventing eye sore.
In preparation for making the reader_permit a top-level class, and
moving it to another file. It is also good practice to define
non-performance critical methods out-of-line to reduce header bloat.
These unit tests cover all CQL aspects of lightweight transactions,
such as grammar, null semantics, batch semantics, result set
format, and so on.
For now, comment out unicode tests: test output depends
on libjsoncpp version in use.
When the scylla process is stopped no code waits for
current snapshot operations to finish. Also, the API
server is not stopped either, so new snapshot requests
can creep into.
In seastar there's a useful abstraction to address both.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In storage_service's snapshot code there are checks for
_operation_mode being _not_ JOINING to proceed. The intention
is apparently to allow for snapshots only after the cluster
join. However, here's how the start-up code looks like
- _operation_mode = STARTING in storage_service::constructor
- snapshot API registered in api::set_server_storage_service
- _operation_mode = JOINING in storage_service::join_token_ring
So in between steps 2 and 3 snapshots can be taken.
Although there's a quick and simple fix for that (check for the
_operation_mode to be not STARTING either) I think it's better
to register the snapshot API later instead. This will help
greatly to de-bload the storage_service, in particular -- to
incapsulate the _operation_mode properly.
Note, though the check for _operation_mode is made only for
taking snapshot, I move all snapshot ops registration to the
later phase.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is preparation for the next patch -- the lambda in
question (and the used type) will be needed in two
functions, so make the lambda a "real" function.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make sure that the timestamp argument does not overflow
60 bits when converted to units of 100 nanos since
epoch, like with writetime() that returns microseconds since epoch
in contrast to other time functions like
unixtimestampof that return millis since epoch.
Fixes#5552
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Safely convert millis to "nanos_since" (number of 100
nanseconds since START_EPOCH) while type casting to uint64_t
to avoid possible int overflow.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We need to add '~' to handle rcX version correctly on Debian variants
(merged at ae33e9f), but when we moved to relocated package we mistakenly
dropped the code, so add the code again.
Fixes#5641
The method view_info::partition_ranges() is unused.
Also drop the now-dead _partition_ranges data member.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
This is a fourth iteration of the patch series adding LWT usage
(instead of the old naive - and wrong - read before write) to
Alternator, as well as full support for the ConditionExpression
syntax for conditional updates.
Changes in v4:
* Rebased to most recent master
* Replaced 3 booleans which had 2^3 = 8 theoretical combinations,
by just 4 options in enum write_isolation:
FORBID_RMW, LWT_ALWAYS, LWT_RMW_ONLY, UNSAFE_RMW
The four options are described in details comments.
* Fix reversed assertion in FORBID_RMW case.
* Two new metrics: write_using_lwt and shard_bounce_for_lwt.
* Fail boot if alternator is enabled, but LWT isn't.
* Add information about enabling LWT in docs/alternator/alternator.md
* nyh/v4-lwt:
alternator: add support for ConditionExpression
alternator: reimplement read-modify-write operations using LWT
alternator: make "executor" a peering_sharded_service
Implements a counter of executions of SELECT queries with ALLOW FILTERING option.
In scope of #5209
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Previous patch makes it impossible to configure Scylla
with RandomPartitioner so this code is effectively dead now.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
RandomPartitioner has been deprecated for 2.5 year.
Now we drop the support for it. There are two reasons for this.
First, this partitioner can lead to uneven distribution of partitions
among the nodes in the cluster which leads to hot nodes.
Second, we're planning to unify the representation of tokens and
fix it as int64_t. RandomPartitioner does not comply with this.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Previous patch makes it impossible to configure Scylla
with ByteOrderedPartitioner so this code is effectively dead now.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
ByteOrderedPartitioner has been deprecated for 2.5 year.
Now we drop the support for it. There are two reasons for this.
First, this partitioner can lead to uneven distribution of partitions
among the nodes in the cluster which leads to hot nodes.
Second, we're planning to unify the representation of tokens and
fix it as int64_t. ByteOrderPartitioner does not comply with this.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
OrderPreservingPartitioner seems to be long gone and not supported
so remove all the places it's still mentioned.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Even though we configure the set of accepted values for
some config flags, named_value ignore them.
This patch implements the checks that verify flag is
not set to the value that's not on the list.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This patch adds support for the ConditionExpression parameter of the
item-writing operations in Alternator: PutItem, UpdateItem and DeleteItem.
We already supported conditional updates/put/delete using the "Expected"
parameter. The ConditionExpression parameter implemented here provides a
very similar feature, using a different - and also newer and more powerful -
syntax.
The implementation here reuses much of our existing expression-parsing
infrastructure. Unsurprisingly, ConditionExpression's syntax has much in
common with UpdateExpression which we already support) and also many of the
comparison functions already implemented for "Expected". However, it's still
quite a bit of new code, because of the many different comparisons, functions,
and syntax variations we need to support.
This patch also expands alternator-test/test_condition_expression.py with
a few additional corner cases discovered during the development of this
patch.
Almost all of the tests for this feature (35 out of 39) now pass.
Two tests still fail because we don't yet support nested attributes (this
is a missing feature across Alternator), and two tests fail because of minor
ideosyncracies in DynamoDB's error path that we chose not to duplicate
yet (but still remember the difference in the form of an xfailing test).
Fixes#5035
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In this patch, we re-implement the three read-modify-write operations -
PutItem, UpdateItem, DeleteItem. All three operations may need to read the
item before writing it to support conditional updates (the "Expected"
parameter) and UpdateItem may also need the previous item's value for
its update expression (e.g., a user may ask to "set a=a+1" or "set a=b").
Before this patch, the implementation of RMW operations simply did a read,
and then a write - without any attempt to protect concurrent operations.
In this patch, Scylla's LWT mechanism (storage_proxy::cas()) is used
instead, to ensure that concurrent update operations are correctly
isolated even if they are conditional. This means that Alternator now
requires the experimental LWT feature to be enabled (and refuses to
boot if it isn't).
The version presented here is configured to always use LWT for *every*
write, regardless of whether it has a condition or not. So it will
will significantly slow down write-only workloads like YCSB. But the code
in this patch actually includes three other modes, which can be chosen by
setting an enum constant in the code. In the future we will want to let the
user configure this mode, globally, per table or per attribute.
Note that read requests are NOT modified, and work exactly as they did
before: i.e., strongly-consistent reads are done using a normal
CL=LOCAL_QUORUM read - not via LWT. I believe this is good enough given
Dynamo's guarantees, and critical for our read performance.
Also note that patch doesn't yet fix the BatchWriteItem operation.
Although BatchWriteItem does not support any RMW operations - just pure
writes - we may still need to do those pure writes using LWT. This
should be fixed in a follow-up patch.
Unfortunately, this patch involves a large amount of code movement and
reorganization, because:
1. The cas operation requires each operation to be made into an object,
with a separate apply() function, forcing a lot of code to move.
2. Moreover, we need to do this for three different operations (PutItem,
UpdateItem, DeleteItem) so to avoid massive code duplication, I had
to move some common code.
3. The cas operation also forced us to change some of the utility functions'
APIs.
The end result is that this patch focuses more on a compact and
understandable *end result* than it does on an easy to understand *patch*,
so reviewers - sorry about that.
All alternator-test/ tests pass with this patch (and also with all of the
different optional modes enabled). However, other than that, I did not yet
do any real isolation tests (are concurrent operations really isolated
correctly? or is LWT just faking it? :-) ), performance tests or stress
tests - and I'll definitely need to do those as well.
Fixes#5054
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Alternator uses a sharded<executor> for handling execution of Alternator
requests on different shards. In this patch we make executor a subclass of
peering_sharded_service, to allow one of these executors to run an exector
method on a different shard: Any one of the shard-local executor instances
can call container() to get the full sharded<executor>.
We will need this capability later, when we need to bounce requests between
shards because of requirements of the storage_proxy::cas (LWT) code.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Following patch will start checking allowed_values
in named_value and print errors for wrong values.
This will require all the types used with named_value
to have operator<< implemented. seed_provider_type
is one such type.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The output of antrl always has the timestamp of when it was
created. This expands the existing sed hack to remove that too.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
If at any point during the topological sort we had more than one node
with zero dependencies, the order they were printed was not
deterministic.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
This series refactors the code used by migration_notifier and gossiper
into an atomic_vector type.
"
* 'espindola/gossiper_atomic_vector' of https://github.com/espindola/scylla:
gossiper: Store subscribers in an atomic_vector
load_broadcaster: Unregister from load_broadcaster::stop_broadcasting
repair: add row_level::stop()
locator: Return future from i_endpoint_snitch::reload_gossiper_state
service: Refactor code into a atomic_vector class
migration_manager: Fix typo
load_meter: Use a shared_ptr to store a load_broadcaster
The new guarantees are a bit better IMHO:
Once a subscriber is removed, it is never notified. This was not true
in the old code since it would iterate over a copy that would still
have that subscriber.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This templates the code for listener_vector, renames it to
atomic_vector and moves it to the utils directory.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
load_broadcaster::stop_broadcasting uses shared_from_this(). Since
that is the only reference that the produced shared_ptr knows of, it
is deleted immediately. Fix that by also using a shared_ptr in
load_meter.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
* seastar afc46681...147d50b1 (6):
> perftune.py: Use safe_load() for fix arbitrary code execution
Fixes#5630
> clang: current_exception_as_future must be in namespaced
> tests: add an expected failures version of thread fixture
> Enable stack guards in Dev builds
> net: posix: Introduce load_balancing_algorithm::fixed
> stream: Move _next from subscription to stream
`parsed_statement::get_bound_variables` is assumed to always
return a nonnull pointer to `variable_specifications` instance.
In this case using a pointer is superfluous and can be safely
replaced by a plain reference.
Also add a default ctor and a utility method `set_bound_variables`
to the `variable_specifications` class to actually reset the
contents of the class instance.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200120195839.164296-1-pa.solodovnikov@scylladb.com>
"
As a followup to 0bde590
This series implements suggestions from @avikivity and @espindola
It simplifies the template definitions for accumulator_for,
adds some debug logging for the overflow values,
and adds unit tests for float and double sum overflow.
Test: unit(dev),
paging_test:TestPagingWithIndexingAndAggregation.test_filter_{indexed,non_indexed,pk}_column(dev)
"
* 'simplify-sum-overflow' of https://github.com/bhalevy/scylla:
test: cql_query_test: test float/double sum overflow
cql3: aggregate_fcts: simplify accumulator_for template definitions
It is the only place where get_changed_ranges_for_leaving is not running
inside a thread. Preparing patch to futurize get_changed_ranges_for_leaving.
Refs: #5495
A mistake in handling legacy checks for special 'idx_token' column
resulted in not recognizing materialized views backing secondary
indexes properly. The mistake is really a typo, but with bad
consequences - instead of checking the view schema for being an index,
we asked for the base schema, which is definitely not an index of
itself.
Branches 3.1,3.2 (asap)
Fixes#5621Fixes#4744
Before this patch the iterations over migration_notifier::_listeners
could race with listeners being added and removed.
The addition side is not modified, since it is common to add a
listener during construction and it would require a fairly big
refactoring. Instead, the iteration is modified to use indexes instead
of iterators so that it is still valid if another listener is added
concurrently.
For removal we use a rw lock, since removing an element invalidates
indexes too. There are only a few places that needed refactoring to
handle unregister_listener returning a future<>, so this is probably
OK.
Fixes#5541.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200120192819.136305-1-espindola@scylladb.com>
"
This series fixes all abandoned failed futures in cql_query_test and
starts running it with --fail-on-abandoned-failed-futures to avoid
regressions.
"
* 'espindola/fix-abandoned-failed-futures' of https://github.com/espindola/scylla:
cql_query_test: Avoid new abandoned failed futures
cql_query_test: Explicitly ignore a failed future
cql_query_test: Remove duplicated do_with_cql_env_thread
cql_query_test: Fix cql and values in test_int_sum_with_cast
Now that cql_query_test has no abandoned failed futures, run it with
--fail-on-abandoned-failed-futures to avoid regressions.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This test is not running because of the double
do_with_cql_env_thread. Fix it before we remove the extra
do_with_cql_env_thread.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
* seastar 3f3e117de3...afc46681e5 (7):
> json: add move assignment to json_return_type
> net: do not check if an unsigned variabe is less than 0
> stack: add virtual destructor definition for class w/ virtual functions
> future,json: add ":" at end of concept definition
> Fixing a bug in the handling of abort_accept()
> install-dependencies.sh: improve arch detect
> metrics: Avoid a copy during unregistration
We have numerous tests that rely on the seastar alloc failure injection
infrastructure to test the exception safety of different components.
These tests are essentially useless when the said infrastructure is
not enabled, which is currently the case for all build modes, allowing
bugs to sneak in undetected.
Enable the allocation failure injection infrastructure for the dev and
debug modes. Sanitize is excluded as it produces some (suspected false
positive) failures and is not run in gating either currently.
Tests: unit(dev, debug)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200117104747.748866-1-bdenes@scylladb.com>
Merged patch series from Piotr Sarna:
This miniseries adds two simple prerequisites for implementing tagging:
1. A table is able to generate its Arn identifier
2. Simple tests for TagResource, UntagResource, ListTagsOfResource
In general, tags should be stored in table metadata - either by
expanding the schema of an existing schema table, e.g. scylla_tables,
or by providing another meta-table - e.g. system_schema.alternator_tables,
which stores alternator-specific metadata, like tags.
Refs #5066
Tests: alternator-test(local, remote)
Piotr Sarna (2):
alternator: add Arn support for tables
alternator-test: add basic tests for tags
alternator-test/test_describe_table.py | 1 -
alternator-test/test_tag.py | 88 ++++++++++++++++++++++++++
alternator/executor.cc | 5 ++
3 files changed, 93 insertions(+), 1 deletion(-)
create mode 100644 alternator-test/test_tag.py
Several API-s, e.g. TagResource, UntagResource and ListTagsOfResource
rely on identifying tables by their "Arn". According to the docs,
an Arn should uniquely identify a resource, so it's implemented as:
arn:KEYSPACE_NAME:TABLE_NAME
which is a minimal set of information that uniquely identifies a table
in Scylla. The `arn:` prefix is needed for compatibility purposes.
This commit adds a simple function for generating the Arn string,
and also includes it in DescribeTable result under the TableArn attribute.
Refs #5066
Add a name parameter to the validator, so that the validator can be
identified in log messages. Schema identity information is added to the
name automatically. This should help pinpoint the problematic place
where validation failed.
Although at the moment we have a single validator, it still benefits
from having a name, as we can now include in it the name of the sstable
being written and hence trace the source of the bad data.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200117150616.895878-1-bdenes@scylladb.com>
Update Packer to 1.5.1.
Needed to rename clean_ami_name -> clean_resource_name on scylla.json, since
the variable name had been changed.
Also fixed checksum verification code, trimmed unwanted extra strings
from sha256sum output.
Since we need to run relocate_python_scripts.py on install time,
python script may not able to run on various different environment.
So convert the script to bash script, merge it into install.sh.
When a node does not have gossip STATUS application_state, we currently
use an empty string to present such state in get_gossip_status.
It is better to use an explicit "UNKNOWN" to present it. It makes the
log easier to understand when the status is unknown.
Before:
'gossip - InetAddress n2 is now UP, status ='
After:
'gossip - InetAddress n2 is now UP, status = UNKNOWN'
This patch is safe because the STATUS_UNKNOWN is never sent over the
cluster. So the presentation is only internal to the node.
Fixes#5520
The atomic_cell pretty printers use a mix of commas and semicolons.
This change makes them use commas everywhere, for consistency.
Message-Id: <20200116133327.2610280-1-avi@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5567
from Calle Wilund:
Fixes#5314
Instead of tying CDC handling into cql statement objects, this patch set
moves it to storage proxy, i.e. shared code for mutating stuff. This means
we automatically handle cdc for code paths outside cql (i.e. alternator).
It also adds api handling (though initially inefficient) for batch statements.
CDC is tied into storage proxy by giving the former a ref to the latter (per
shard). Initially this is not a constructor parameter, because right now we
have chicken and egg issues here. Hopefully, Pavels refactoring of migration
manager and notifications will untie these and this relationship can become
nicer.
The actual augmentation can (as stated above) be made much more efficient.
Hopefully, the stream management refactoring will deal with expensive stream
lookup, and eventually, we can maybe coalesce pre-image selects for batches.
However, that is left as an exercise for when deemed needed.
The augmentation API has an optional return value for a "post-image handler"
to be used iff returned after mutation call is finished (and successful).
It is not yet actually invoked from storage_proxy, but it is at least in the
call chain.
The set make dependencies between mm and other services cleaner,
in particular, after the set:
- the query processor no longer needs migration manager
(which doesn't need query processor either)
- the database no longer needs migration manager, thus the mutual
dependency between these two is dropped, only migration manager
-> database is left
- the migration manager -> storage_service dependency is relaxed,
one more patchset will be needed to remove it, thus dropping one
more mutual dependency between them, only the storage_service
-> migration manager will be left
- the migration manager is stopped on drain, but several more
services need it on stop, thus causing use after free problems,
in particular there's a caught bug when view builder crashes
when unregistering from notifier list on stop. Fixed.
Tests: unit(dev)
Fixes: #5404
Enabling asan enables a few cleanup optimizations in gcc. The net
result is that using
-fsanitize=address -fno-sanitize-address-use-after-scope
Produces code that uses a lot less stack than if the file is compiled
with just -O0.
This patch adds -O1 in addition to
-fno-sanitize-address-use-after-scope to protect the unfortunate
developer that decides to build in dev mode with --cflags='-O0 -g'.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200116012318.361732-2-espindola@scylladb.com>
It is sometimes convenient to build with flags that don't match any
existing mode.
Recently I was tracking a bug that would not reproduce with debug, but
reproduced with dev, so I tried debugging the result of
./configure.py --cflags="-O0 -g"
While the binary had debug info, it still had optimizations because
configure.py put the mode flags after the user flags (-O0 -O1). This
patch flips the order (-O1 -O0) so that the flags passed in the
command line win.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200116012318.361732-1-espindola@scylladb.com>
CQL transport code relies on an exception's C++ type to create correct
reply, but in lwt we converted some mutation_timeout exceptions to more
generic request_timeout while forwarding them which broke the protocol.
Do not drop type information.
Fixes#5598.
Message-Id: <20200115180313.GQ9084@scylladb.com>
Murmur3 is the default partitioner.
ByteOrder and Random are the deprecated ones
and should be mentioned in the description.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5294 from
Amnon Heiman:
To use a snapshot we need a schema file that is similar to the result of
running cql DESCRIBE command.
The DESCRIBE is implemented in the cql driver so the functionality needs
to be re-implemented inside scylla.
This series adds a describe method to the schema file and use it when doing
a snapshot.
There are different approach of how to handle materialize views and
secondary indexes.
This implementation creates each schema.cql file in its own relevant
directory, so the schema for materializing view, for example, will be
placed in the snapshot directory of the table of that view.
Fixes#4192
This commit makes most sleeps in gossip.cc abortable. It is now possible
to quickly shut down a node during startup, most notably during the
phase while it waits for gossip to settle.
This reduces network traffic and eliminates time for installation when
building packages from the frozen toolchain, as well as isolating the
build from updates to those dependencies which may cause breakage.
This patch set adds support for CQL tests to test.py,
as well as many other improvements:
* --name is now a positional argument
* test output is preserved in testlog/${mode}
* concise output format
* better color support
* arbitrary number of test suites
* per-suite yaml-based configuration
* options --jenkins and --xunit are removed and xml
files are generated for all runs
A simple driver is written in C++ to read CQL for
standard input, execute in embedded mode and produce output.
The patch is checked with BYO.
Reviewed-by: Dejan Mircevski <dejan@scylladb.com>
* 'test.py' of github.com:/scylladb/scylla-dev: (39 commits)
test.py: introduce BoostTest and virtualize custom boost arguments
test.py: sort tests within a suite, and sort suites
test.py: add a basic CQL test
test.py: add CQL .reject files to gitignore
test.py: print a colored unidiff in case of test failure
test.py: add CqlTestSuite to run CQL tests
test.py: initial import of CQL test driver, cql_repl
test.py: remove custom colors and define a color palette
test.py: split test output per test mode
test.py: remove tests_to_run
test.py: virtualize Test.run(), to introduce CqlTest.Run next
test.py: virtualize test search pattern per TestSuite
test.py: virtualize write_xunit_report()
test.py: ensure print_summary() is agnostic of test type
test.py: tidy up print_summary()
test.py: introduce base class Test for CQL and Unit tests
test.py: move the default arguments handling to UnitTestSuite
test.py: move custom unit test command line arguments to suite.yaml
test.py: move command line argument processing to UnitTestSuite
test.py: introduce add_test(), which is suite-specific
...
The Ubuntu-based Docker image uses Scylla 1.0 and has not been updated
since 2017. Let's remove it as unmaintained.
Message-Id: <20200115102405.23567-1-penberg@scylladb.com>
"
Currently commitlog supports two modes of operation. First is 'periodic'
mode where all commitlog writes are ready the moment they are stored in
a memory buffer and the memory buffer is flushed to a storage periodically.
Second is a 'batch' mode where each write is flushed as soon as possible
(after previous flush completed) and writes are only ready after they
are flushed.
The first option is not very durable, the second is not very efficient.
This series adds an option to mark some writes as "more durable" in
periodic mode meaning that they will be flushed immediately and reported
complete only after the flush is complete (flushing a durable write also
flushes all writes that came before it). It also changes paxos to use
those durable writes to store paxos state.
Note that strictly speaking the last patch is not needed since after
writing to an actual table the code updates paxos table and the later
uses durable writes that make sure all previous writes are flushed. Given
that both writes supposed to run on the same shard this should be enough.
But it feels right to make base table writes durable as well.
"
* 'gleb/commilog_sync_v4' of github.com:scylladb/seastar-dev:
paxos: immediately sync commitlog entries for writes made by paxos learn stage
paxos: mark paxos table schema as "always sync"
schema: allow schema to be marked as 'always sync to commitlog'
commitlog: add test for per entry sync mode
database: pass sync flag from db::apply function to the commitlog
commitlog: add sync method to entry_writer
Before this patch result_set_assertions was handling both null values
and missing values in the same way.
This patch changes the handling of missing values so that now checking
for a null value is not the same as checking for a value not being
present.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200114184116.75546-1-espindola@scylladb.com>
3ec889816 changed cell::make_collection() to take different code paths
depending whether its `data` argument is nothrow copyable/movable or
not. In case it is not, it is wrapped in a view to make it so (see the
above mentioned commit for a full explanation), relying on the methods
pre-existing requirement for callers to keep `data` alive while the
created writer is in use.
On closer look however it turns out that this requirement is neither
respected, nor enforced, at least not on the code level. The real
requirement is that the underlying data represented by `data` is kept
alive. If `data` is a view, it is not expected to be kept alive and
callers don't, it is instead copied into `make_collection()`.
Non-views however *are* expected to be kept alive. This makes the API
error prone.
To avoid any future errors due to this ambiguity, require all `data`
arguments to be nothrow copyable and movable. Callers are now required
to pass views of nonconforming objects.
This patch is a usability improvement and is not fixing a bug. The
current code works as-is because it happens to conform to the underlying
requirements.
Refs: #5575
Refs: #5341
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200115084520.206947-1-bdenes@scylladb.com>
This patch adds tests for the describe method.
test_describe_simple_schema tests regular tables.
test_describe_view_schema tests view and index.
Each test, create a table, find the schema, call the describe method and
compare the results to the string that was used to create the table.
The view tests also verify that adding an index or view does not change
the base table.
When comparing results, leading and trailing white spaces are ignored
and all combination of whitespaces and new lines are treated equaly.
Additional tests may be added at a future phase if required.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
When creating a snapshot we need to add a schema.cql file in the
snapshot directory that describes the table in that snapshot.
This patch adds the file using the schema describe method.
get_snapshot_details and manifest_json_filter were modified to ignore
the schema.cql file.
Fixes#4192
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a describe method to a table schema.
It acts similar to a DESCRIBE cql command that is implemented in a CQL
driver.
The method supports tables, secondary indexes local indexes and
materialize views.
relates to: #4192
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
index_name_from_table_name is a reverse of index_table_name,
it gets a table name that was generated for an index and return the name
of the index that generated that table.
Relates to #4192
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The factory is purely a state-less thing, there is no difference what
instance of it to use, so we may omit referencing the storage_service
in passive_announce
This is 2nd simple migration_manager -> storage_service link to cut
(more to come later).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several places where migration_manager needs storage_service
reference to get the database from, thus forming the mutual dependency
between them. This is the simplest case where the migration_manager
link to the storage_service can be cut -- the databse reference can be
obtained from storage_proxy instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the last place where database code needs the migration_manager
instance to be alive, so now the mutual dependency between these two
is gone, only the migration_manager needs the database, but not the
vice-versa.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage_server needs migration_manager for notifications and
carefully handles the manager's stop process not to demolish the
listeners list from under itself. From now on this dependency is
no longer valid (however the storage_service seems still need the
migration_manager, but this is different story).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch removes an implicit cql_server -> migration_manager
dependency, as the former's event notifier uses the latter
for notifications.
This dependency also breaks a loop:
storage_service -> cql_server -> migration_manager -> storage_service
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch breaks one (probably harmless but still) dependency
loop. The query_processor -> migration_manager -> storage_proxy
-> tracing -> query_processor.
The first link is not not needed, as the query_processor needs the
migration_manager purely to (ub)subscribe on notifications.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The same as with view builder. The constructor still needs both,
but the life-time reference is now for notifier only.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The migration manager itself is still needed on start to wait
for schema agreement, but there's no longer the need for the
life-time reference on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Do not call for local migration manager instance to send notifications,
call for the local migration notifier, it will always be alive.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage service will need this guy to initialize sub-services
with. Also it registers itself with notifiers.
That said, it's convenient to have the migration notifier on board.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The _listeners list on migration_manager class and the corresponding
notify_xxx helpers have nothing to do with the its instances, they
are just transport for notification delivery.
At the same time some services need the migration manager to be alive
at their stop time to unregister from it, while the manager itself
may need them for its needs.
The proposal is to move the migration notifier into a complete separate
sharded "service". This service doesn't need anything, so it's started
first and stopped last.
While it's not effectively a "migration" notifier, we inherited the name
from Cassandra and renaming it will "scramble neurons in the old-timers'
brains but will make it easier for newcomers" as Avi says.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The comparator is refreshed to ensure the following:
- null compares less to all other types;
- null, true and false are comparable against each other,
while other types are only comparable against themselves and null.
Comparing mixed types is not currently reachable from the alternator
API, because it's only used for sets, which can only use
strings, binary blobs and numbers - thus, no new pytest cases are added.
Fixes#5454
When counter mutation is about to be sent, a leader is elected, but
if the leader fails after election, we get `rpc::closed_error`. The
exception propagates high up, causing all connections to be dropped.
This patch intercepts `rpc::closed_error` in `storage_proxy::mutate_counters`
and translates it to `mutation_write_failure_exception`.
References #2859
Run the test and compare results. Manage temporary
and .reject files.
Now that there are CQL tests, improve logging.
run_test success no longer means test success.
cql_repl is a simple program which reads CQL from stdin,
executes it, and writes results to stdout.
It support --input, --output and --log options.
--log is directed to cql_test.log by default.
--input is stdin by default
--output is stdout by default.
The result set output is print with a basic
JSON visitor.
Store test temporary files and logs in ${testdir}/${mode}.
Remove --jenkins and --xunit, and always write XML
files at a predefined location: ${testdir}/${mode}/xml/.
Use .xunit.xml extension for tests which XML output is
in xunit format, and junit.xml for an accumulated output
of all non-boost tests in junit format.
Load the command line arguments, if any, from suite.yaml, rather
than keep them hard-coded in test.py.
This is allows operations team to have easier access to these.
Note I had to sacrifice dynamic smp count for mutation_reader_test
(the new smp count is fixed at 3) since this is part
of test configuration now.
This way we can avoid iterating over all tests
to handle --repeat.
Besides, going forward the tests will be stored
in two places: in the global list of all tests,
for the runner, and per suite, for suite-based
reporting, so it's easier if TestSuite
if fully responsible for finding and adding tests.
Scan entire test/ for folders that contain suite.yaml,
and load tests from these folders. Skip the rest.
Each folder with a suite.yaml is expected to have a valid
suite configuration in the yaml file.
A suite is a folder with test of the same type. E.g.
it can be a folder with unit tests, boost tests, or CQL
tests.
The harness will use suite.yaml to create an appropriate
suite test driver, to execute tests in different formats.
It reduces the number of configurations to re-test when test.py is
modified. and simplifies usage of test.py in build tools, since you no
longer need to bother with extra arguments.
Going forward I'd like to make terminal output brief&tabular,
but some test details are necessary to preserve so that a failure
is easy to debug. This information now goes to the log file.
- open and truncate the log file on each harness start
- log options of each invoked test in the log, so that
a failure is easy to reproduce
- log test result in the log
Since tests are run concurrently, having an exact
trace of concurrent execution also helps
debugging flaky tests.
The storage_service struct is a collection of diverse things,
most of them requiring only on start and on stop and/or runing
on shard 0 (but is nonetheless sharded).
As a part of clearing this structure and generated by it inter-
-componenes dependencies, here's the sanitation of load_broadcaster.
Fixes#5582
... but only populate log on shard 0.
Migration manager callbacks are slightly assymetric. Notifications
for pre-create/update mutations are sent only on initiating shard
(neccesary, because we consider the mutations mutable).
But "created" callbacks are sent on all shards (immutable).
We must subscribe on all shards, but still do population of cdc table
only once, otherwise we can either miss table creat or populate
more than once.
v2:
- Add test case
Message-Id: <20200113140524.14890-1-calle@scylladb.com>
* seastar 36cf5c5ff0...3f3e117de3 (16):
> memcached: don't use C++17-only std::optional
> reactor: Comment why _backend is assigned in constructor body
> log: restore --log-to-stdout for backward compatibility
> used_size.hh: Include missing headers
> core: Move some code from reactor.cc to future.cc
> future-util: move parallel_for_each to future-util.cc
> task: stop wrapping tasks with unique_ptr
> Merge "Setup timer signal handler in backend constructor" from Pavel
Fixes#5524
> future: avoid a branch in future's move constructor if type is trivial
> utils: Expose used_size
> stream: Call get_future early
> future-util: Move parallel_for_each_state code to a .cc
> memcached: log exceptions
> stream: Delete dead code
> core: Turn pollable_fd into a simple proxy over pollable_fd_state.
> Merge "log to std::cerr" from Benny
This is the part of de-bloating storage_service.
The field in question is used to temporary keep the _token_metadata
value during shard-wide replication. There's no need to have it as
class member, any "local" copy is enough.
Also, as the size of token_metadata is huge, and invoke_on_all()
copies the function for each shard, keep one local copy of metadata
using do_with() and pass it into the invoke_on_all() by reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Reviewed-by: Asias He <asias@scylladb.com>
Message-Id: <20200113171657.10246-1-xemul@scylladb.com>
The query option always_return_static_content was added for lightweight
transations in commits e0b31dd273 (infrastructure) and 65b86d155e
(actual use). However, the flag was added unconditionally to
update_parameters::options. This caused it to be set for list
read-modify-write operations, not just for lightweight transactions.
This is a little wasteful, and worse, it breaks compatibility as old
nodes do not understand the always_return_static_content flag and
complain when they see it.
To fix, remove the always_return_static_content from
update_parameters::options and only set it from compare-and-swap
operations that are used to implement lightweight transactions.
Fixes#5593.
Reviewed-by: Gleb Natapov <gleb@scylladb.com>
Message-Id: <20200114135133.2338238-1-avi@scylladb.com>
The drain_in_progress variable here is the future that's set by the
drain() operation itself. Its promise is set when the drain() finishes.
The check for this future in the beginning of drain() is pointless.
No two drain()-s can run in parallels because of run_with_api_lock()
protection. Doing the 2nd drain after successfull 1st one is also
impossible due to the _operation_mode check. The 2nd drain after
_exceptioned_ (and thus incomplete) 1st one will deadlock, after
this patch will try to drain for the 2nd time, but that should by ok.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200114094724.23876-1-xemul@scylladb.com>
This change introduces system.clients table, which provides
information about CQL clients connected.
PK is the client's IP address, CK consists of outgoing port number
and client_type (which will be extended in future to thrift/alternator/redis).
Table supplies also shard_id and username. Other columns,
like connection_stage, driver_name, driver_version...,
are currently empty but exist for C* compatibility and future use.
This is an ordinary table (i.e. non-virtual) and it's updated upon
accepting connections. This is also why C*'s column request_count
was not introduced. In case of abrupt DB stop, the table should not persist,
so it's being truncated on startup.
Resolves#4820
"
Most of the code in `cell` and the `imr` infrastructure it is built on
is `noexcept`. This means that extra care must be taken to avoid rouge
exceptions as they will bring down the node. The changes introduced by
0a453e5d3a did just that - introduced rouge `std::bad_alloc` into this
code path by violating an undocumented and unvalidated assumption --
that fragment ranges passed to `cell::make_collection()` are nothrow
copyable and movable.
This series refactors `cell::make_collection()` such that it does not
have this assumption anymore and is safe to use with any range.
Note that the unit test included in this series, that was used to find
all the possible exception sources will not be currently run in any of
our build modes, due to `SEASTAR_ENABLE_ALLOC_FAILURE_INJECTION` not
being set. I plan to address this in a followup because setting this
flags fails other tests using the failure injection mechanism. This is
because these tests are normally run with the failure injection disabled
so failures managed to lurk in without anyone noticing.
Fixes: #5575
Refs: #5341
Tests: unit(dev, debug)
"
* 'data-cell-make-collection-exception-safety/v2' of https://github.com/denesb/scylla:
test: mutation_test: add exception safety test for large collection serialization
data/cell.hh: avoid accidental copies of non-nothrow copiable ranges
utils/fragment_range.hh: introduce fragment_range_view
We do not yet support the ScanIndexForward=false option for reversing
the sort order of a Query operation, as reported in issue #5153.
But even before implementing this feature, it is important that we
produce an error if a user attempts to use it - instead of outright
ignoring this parameter and giving the user wrong results. This is
what this patch does.
Before this patch, the reverse-order query in the xfailing test
test_query.py::test_query_reverse seems to succeed - yet gives
results in the wrong order. With this patch, the query itself fails -
stating that the ScanIndexForward=false argument is not supported.
Refs #5153
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200105113719.26326-1-nyh@scylladb.com>
Here's another theoretical problem, that involves 3 sequential calls
to respectively removenode, force_removenode and some other operation.
Let's walk through them
First goes the removenode:
run_with_api_lock
_operation_in_progress = "removenode"
storage_service::remove_node
sleep in replicating_nodes.empty() loop
Now the force_removenode can run:
run_with_no_api_lock
storage_service::force_removenode
check _operation_in_progress (not empty)
_force_remove_completion = true
sleep in _operation_in_progress.empty loop
Now the 1st call wakes up and:
if _force_remove_completion == true
throw <some exception>
.finally() handler in run_with_api_lock
_operation_in_progress = <empty>
At this point some other operation may start. Say, drain:
run_with_api_lock
_operation_in_progress = "drain"
storage_service::drain
...
go to sleep somewhere
No let's go back to the 1st op that wakes up from its sleep.
The code it executes is
while (!ss._operation_in_progress.empty()) {
sleep_abortable()
}
and while the drain is running it will never exit.
However (! and this is the core of the race) should the drain
operation happen _before_ the force_removenode, another check
for _operation_in_progress would have made the latter exit with
the "Operation drain is in progress, try again" message.
Fix this inconsistency by making the check for current operation
every wake-up from the sleep_abortable.
Fixes#5591
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Here's a theoretical problem, that involves 3 sequential calls
to respectively removenode, force_removenode and removenode (again)
operations. Let's walk through them
First goes the removenode:
run_with_api_lock
_operation_in_progress = "removenode"
storage_service::remove_node
sleep in replicating_nodes.empty() loop
Now the force_removenode can run:
run_with_no_api_lock
storage_service::force_removenode
check _operation_in_progress (not empty)
_force_remove_completion = true
sleep in _operation_in_progress.empty loop
Now the 1st call wakes up and:
if _force_remove_completion == true
_force_remove_completion = false
throw <some exception>
.finally() handler in run_with_api_lock
_operation_in_progress = <empty>
! at this point we have _force_remove_completion = false and
_operation_in_progress = <empty>, which opens the following
opportunity for the 3d removenode:
run_with_api_lock
_operation_in_progress = "removenode"
storage_service::remove_node
sleep in replicating_nodes.empty() loop
Now here's what we have in 2nd and 3rd ops:
1. _operation_in_progress = "removenode" (set by 3rd) prevents the
force_removenode from exiting its loop
2. _force_remove_completion = false (set by 1st on exit) prevents
the removenode from waiting on replicating_nodes list
One can start the 4th call with force_removenode, it will proceed and
wake up the 3rd op, but after it we'll have two force_removenode-s
running in parallel and killing each other.
I propose not to set _force_remove_completion to false in removenode,
but just exit and let the owner of this flag unset it once it gets
the control back.
Fixes#5590
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Other types do not have a wider accumulator at the moment.
And static_cast<accumulator_type>(ret) != _sum evaluates as
false for NaN/Inf floating point values.
Fixes#5586
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200112183436.77951-1-bhalevy@scylladb.com>
Now that atomic_cell_view and collection_mutation_view have
type-aware printers, we can use them in the type-aware atomic_cell_or_collection
printer.
Message-Id: <20191231142832.594960-1-avi@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5533
from Avi Kivity:
canonical_mutation objects are used for schema reconciliation, which is a
fragile area and thus deserves some debugging help.
This series makes canonical_mutation objects printable.
Merged patch series from Piotr Sarna:
"Previous assumption was that there can only be one regular base column
in the view key. The assumption is still correct for tables created
via CQL, but it's internally possible to create a view with multiple
such columns - the new assumption is that if there are multiple columns,
they share their liveness.
This series is vital for indexing to work properly on alternator,
so it would be best to solve the issue upstream. I strived to leave
the existing semantics intact as long as only up to one regular
column is part of the materialized view primary key, which is the case
for Scylla's materialized views. For alternator it may not be true,
but all regular columns in alternator share liveness info (since
alternator does not support per-column TTL), which is sufficient
to compute view updates in a consistent way.
Fixes#5006
Tests: unit(dev), alternator(test_gsi_update_second_regular_base_column, tic-tac-toe demo)"
Piotr Sarna (3):
db,view: fix checking if partition key is empty
view: handle multiple regular base columns in view pk
test: add a case for multiple base regular columns in view key
alternator-test/test_gsi.py | 1 -
view_info.hh | 5 +-
cql3/statements/alter_table_statement.cc | 2 +-
db/view/view.cc | 77 ++++++++++++++----------
mutation_partition.cc | 2 +-
test/boost/cql_query_test.cc | 58 ++++++++++++++++++
6 files changed, 109 insertions(+), 36 deletions(-)
Merged patch series from Gleb Natapov:
"LWT is much more efficient if a request is processed on a shard that owns
a token for the request. This is because otherwise the processing will
bounce to an owning shard multiple times. The patch proposes a way to
move request to correct shard before running lwt. It works by returning
an error from lwt code if a shard is incorrect one specifying the shard
the request should be moved to. The error is processed by the transport
code that jumps to a correct shard and re-process incoming message there.
The nicer way to achieve the same would be to jump to a right shard
inside of the storage_proxy::cas(), but unfortunately with current
implementation of the modification statements they are unusable by
a shard different from where it was created, so the jump should happen
before a modification statement for an cas() is created. When we fix our
cql code to be more cross-shard friendly this can be reworked to do the
jump in the storage_proxy."
Gleb Natapov (4):
transport: change make_result to takes a reference to cql result
instead of shared_ptr
storage_service: move start_native_transport into a thread
lwt: Process lwt request on a owning shard
lwt: drop invoke_on in paxos_state prepare and accept
auth/service.hh | 5 +-
message/messaging_service.hh | 2 +-
service/client_state.hh | 30 +++-
service/paxos/paxos_state.hh | 10 +-
service/query_state.hh | 6 +
service/storage_proxy.hh | 2 +
transport/messages/result_message.hh | 20 +++
transport/messages/result_message_base.hh | 4 +
transport/request.hh | 4 +
transport/server.hh | 25 ++-
cql3/statements/batch_statement.cc | 6 +
cql3/statements/modification_statement.cc | 6 +
cql3/statements/select_statement.cc | 8 +
message/messaging_service.cc | 2 +-
service/paxos/paxos_state.cc | 48 ++---
service/storage_proxy.cc | 47 ++++-
service/storage_service.cc | 120 +++++++------
test/boost/cql_query_test.cc | 1 +
thrift/handler.cc | 3 +
transport/messages/result_message.cc | 5 +
transport/server.cc | 203 ++++++++++++++++------
21 files changed, 377 insertions(+), 180 deletions(-)
Use `seastar::memory::local_failure_injector()` to inject al possible
`std::bad_alloc`:s into the collection serialization code path. The test
just checks that there are no `std::abort()`:s caused by any of the
exceptions.
The test will not be run if `SEASTAR_ENABLE_ALLOC_FAILURE_INJECTION` is
not defined.
`cell::make_collection()` assumes that all ranges passed to it are
nothrow copyable and movable views. This is not guaranteed, is not
expressed in the interface and is not mentioned in the comments either.
The changes introduced by 0a453e5d3a to collection serialization, making
it use fragmented buffers, fell into this trap, as it passes
`bytes_ostream` to `cell::make_collection()`. `bytes_ostream`'s copy
constructor allocates and hence can throw, triggering an
`std::terminate()` inside `cell::make_collection()` as the latter is
noexcept.
To solve this issue, non-nothrow copyable and movable ranges are now
wrapped in a `fragment_range_view` to make them so.
`cell::make_collection()` already requires callers to keep alive the
range for the duration of the call, so this does not introduce any new
requirements to the callers. Additionally, to avoid any future
accidents, do not accept temporaries for the `data` parameter. We don't
ever want to move this param anyway, we will either have a trivially
copyable view, or a potentially heavy-weight range that we will create a
trivially copyable view of.
A lightweight, trivially copyable and movable view for fragment ranges.
Allows for uniform treatment of all kinds of ranges, i.e. treating all
of them as a view. Currently `fragment_range.hh` provides lightweight,
view-like adaptors for empty and single-fragment ranges (`bytes_view`). To
allow code to treat owning multi-fragment ranges the shame way as the
former two, we need a view for the latter as well -- this is
`fragment_range_view`.
Resolves#4820. Execution path in main.cc now cleans up system.clients
table if it exists (this is done on startup). Also, server.cc now calls
functions that notify about cql clients connecting/disconnecting.
This simplifies the storage_service API and fixes the
complain about shared_ptr usage instead of unique_ptr.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a lonely get_load_map() call on storage_service that
needs only load broadcaster, always runs on shard 0 and that's it.
Next patch will move this whole stuff into its own helper no-shard
container and this is preparation for this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Since lwt requests are now running on an owning shard there is no longer
a need to invoke cross shard call on paxos_state level. RPC calls may
still arrive to a wrong shard so we need to make cross shard call there.
LWT is much more efficient if a request is processed on a shard that owns
a token for the request. This is because otherwise the processing will
bounce to an owning shard multiple times. The patch proposes a way to
move request to correct shard before running lwt. It works by returning
an error from lwt code if a shard is incorrect one specifying the shard
the request should be moved to. The error is processed by transport code
that jumps to a correct shard and re-process incoming message there.
"
The original fix (10f6b125c8) didn't
take into account that if there was a failed memtable flush (Refs
flush) but is not a flushable memtable because it's not the latest in
the memtable list. If that happens, it means no other memtable is
flushable as well, cause otherwise it would be picked due to
evictable_occupancy(). Therefore the right action is to not flush
anything in this case.
Suspected to be observed in #4982. I didn't manage to reproduce after
triggering a failed memtable flush.
Fixes#3717
"
* tag 'avoid-ooming-with-flush-continuations-v2' of github.com:tgrabiec/scylla:
database: Avoid OOMing with flush continuations after failed memtable flush
lsa: Introduce operator bool() to occupancy_stats
lsa: Expose region_impl::evictable_occupancy in the region class
"
Fix overflow handling in sum() and avg().
sum:
- aggregated into __int128
- detect overflow when computing result and log a warning if found
avg:
- fix division function to divide the accumulator type _sum (__int128 for integers) by _count
Add unit tests for both cases
Test:
- manual test against Cassandra 3.11.3 to make sure the results in the scylla unit test agree with it.
- unit(dev), cql_query_test(debug)
Fixes#5536
"
* 'cql3-sum-overflow' of https://github.com/bhalevy/scylla:
test: cql_query_test: test avg overflow
cql3: functions: protect against int overflow in avg
test: cql_query_test: test sum overflow
cql3: functions: detect and handle int overflow in sum
exceptions: sort exception_code definitions
exceptions: define additional cassandra CQL exceptions codes
"
We were failing to start a thread when the UDF call was nested in an
aggregate function call like SUM.
"
* 'espindola/fix-sum-of-udf' of https://github.com/espindola/scylla:
cql3: Fix indentation
cql3: Add missing with_thread_if_needed call
cql3: Implement abstract_function_selector::requires_thread
remove make_ready_future call
This was initialized to api::missing_timestamp but
should be set to either a client provided-timestamp or
the server's.
Unlike write operations, this timestamp need not be unique
as the one generated by client_state::get_timestamp.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200108074021.282339-2-bhalevy@scylladb.com>
exec->_cmd->read_timestamp may be initialized by default to api::min_timestamp,
causing:
service/storage_proxy.cc:3328:116: runtime error: signed integer overflow: 1577983890961976 - -9223372036854775808 cannot be represented in type 'long int'
Aborting on shard 1.
Do not optimize cross-dc repair if read_timestamp is missing (or just negative)
We're interested in reads that happen within write_timeout of a write.
Fixes#5556
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200108074021.282339-1-bhalevy@scylladb.com>
The cdc service is assigned from outside, post construction, mainly
because of the chickens and eggs in main startup. Would be nice to
have it unconditionally, but this is workable.
It is (and shall) only be called from inside storage proxy,
and we would like this to be reflected in the interface
so our eventual moving of cdc logic into the mutate call
chains become easier to verify and comprehend.
To eventually replace the free function.
Main difference is this is build to both handle batches correctly
and to eventually allow hanging cdc object on storage proxy,
and caches on the cdc object.
The test case checks that having two base regular columns
in the materialized view key (not obtainable via CQL),
still works fine when values are inserted or deleted.
If TTL was involved and these columns would have different expiration
rules, the case would be more complicated, but it's not possible
for a user to reach that case - neither with CQL, nor with alternator.
Previous assumption was that there can only be one regular base column
in the view key. The assumption is still correct for tables created
via CQL, but it's internally possible to create a view with multiple
such columns - the new assumption is that if there are multiple columns,
they share their liveness.
This patch is vital for indexing to work properly on alternator,
so it would be best to solve the issue upstream. I strived to leave
the existing semantics intact as long as only up to one regular
column is part of the materialized view primary key, which is the case
for Scylla's materialized views. For alternator it may not be true,
but all regular columns in alternator share liveness info (since
alternator does not support per-column TTL), which is sufficient
to compute view updates in a consistent way.
Fixes#5006
Tests: unit(dev), alternator(test_gsi_update_second_regular_base_column, tic-tac-toe demo)
Message-Id: <c9dec243ce903d3a922ce077dc274f988bcf5d57.1567604945.git.sarna@scylladb.com>
Now that position_in_partition_view has type-aware printing, use it
to provide a human readable version of clustering keys.
Message-Id: <20191231151315.602559-2-avi@scylladb.com>
If the position_in_partition_view represents a clustering key,
we can now see it with the clustering key decoded according to
the schema.
Message-Id: <20191231151315.602559-1-avi@scylladb.com>
Previous implementation did not take into account that a column
in a partition key might exist in a mutation, but in a DEAD state
- if it's deleted. There are no regressions for CQL, while for
alternator and its capability of having two regular base columns
in a view key, this additional check must be performed.
This reduces code bloat and makes the code friendlier for IDEs, as the
IDE now understands the type of create_schema.
Message-Id: <20191231134803.591190-1-avi@scylladb.com>
sstables::write_simple() has quite a lot of boilerplate
which gets replicated into each template instance. Move
all of that into a non-template do_write_simple(), leaving
only things that truly depend on the component being written
in the template, and encapsulating them with a
noncopyable_function.
An explicit template instantiation was added, since this
is used in a header file. Before, it likely worked by
accident and stopped working when the template became
small enough to inline.
Tests: unit (dev)
Message-Id: <20200106135453.1634311-1-avi@scylladb.com>
mutation_partition_view now supports a compile-time resolved visitor.
This is performant but results in bloat when the performance is not
needed. Furthermore, the template function that applies the object
to the visitor is private and out-of-line, to reduce compile time.
To allow visitation on mutation_partition_view objects, add a virtual
visitor type and a non-template accept function.
Note: mutation_partition_visitor is very similar to the new type,
but different enough to break the template visitor which is used
to implement the new visitor.
The new visitor will be used to implement pretty printing for
canonical_mutation.
Consider this:
1) Write partition_start of p1
2) Write clustering_row of p1
3) Write partition_end of p1
4) Repair is stopped due to error before writing partition_start of p2
5) Repair calls repair_row_level_stop() to tear down which calls
wait_for_writer_done(). A duplicate partition_end is written.
To fix, track the partition_start and partition_end written, avoid
unpaired writes.
Backports: 3.1 and 3.2
Fixes: #5527
commit 21dec3881c introduced
a bug that will cause scylla debian build to fail. This is
because the commit relied on the environment PRODUCT variable
to be exported (and as a result, to propogate to the rename
command that is executed by find in a subshell)
This commit fixes it by explicitly passing the PRODUCT variable
into the rename command.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20200106102229.24769-1-eliransin@scylladb.com>
In scylla, the replacing node is set as HIBERNATE status. It is the only
place we use HIBERNATE status. The replacing node is supposed to be
alive and updating its heartbeat, so it is not supposed to be in dead
state.
This patch fixes the following problem in replacing.
1) start n1, n2
2) n2 is down
3) start n3 to replace n2, but kill n3 in the middle of the replace
4) start n4 to replace n2
After step 3 and step 4, the old n3 will stay in gossip forever until a
full cluster shutdown. Note n3 will only stay in gossip but in
system.peers table. User will see the annoying and infinite logs like on
all the nodes
rpc - client $ip_of_n3:7000: fail to connect: Connection refused
Fixes: #5449
Tests: replace_address_test.py + manual test
VERSION_ID of centos7 is "7", but VERSION_ID of oel7.7 is "7.7"
scylla_ntp_setup doesn't work on OEL7.7 for ValueError.
- ValueError: invalid literal for int() with base 10: '7.7'
This patch changed redhat_version() to return version string, and compare
with parse_version().
Fixes#5433
Signed-off-by: Amos Kong <amos@scylladb.com>
When the progress is queried, e.g., query from nodetool netstats
the progress info might not be updated yet.
Fix it by checking before access the map to avoid errors like:
std::out_of_range (_Map_base::at)
Fixes: #5437
Tests: nodetool_additional_test.py:TestNodetool.netstats_test
This depends on the just emailed fixes to undefined behavior in
tests. With this change we should quickly notice if a change
introduces undefined behavior.
Fixes#4054
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191230222646.89628-1-espindola@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5538 from
Avi Kivity and Piotr Jastrzębski.
This series prepares CDC for rolling upgrade. This consists of
reducing the footprint of cdc, when disabled, on the schema, adding
a cluster feature, and redacting the cdc column when transferring
it to other nodes. The latter is needed because we'll want to backport
this to 3.2, which doesn't have canonical_mutations yet.
If in memory buffer has not enough space for incoming mutation it is
written into a file, but the code missed updating timestamp of a last
sync, so we may sync to often.
Message-Id: <20200102155049.21291-9-gleb@scylladb.com>
The code that enters the gate never defers before leaving, so the gate
behaves like a flag. Lets use existing flag to prohibit adding data to a
closed segment.
Message-Id: <20200102155049.21291-8-gleb@scylladb.com>
Currently segment closing code is spread over several functions and
activated based on the _closed flag. Make segment closing explicit
by moving all the code into close() function and call it where _closed
flag is set.
Message-Id: <20200102155049.21291-6-gleb@scylladb.com>
Currently sync() does two completely different things based on the
shutdown parameter. Separate code into two different function.
Message-Id: <20200102155049.21291-3-gleb@scylladb.com>
The original "test_schema_digest_does_not_change" test case ensures
that schema digests will match for older nodes that do not support
all the features yet (including computed columns).
The additional case uses sstables generated after CDC was enabled
and a table with CDC enabled is created,
in order to make sure that the digest computed
including CDC column does not change spuriously as well.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Addition of cdc column in scylla_tables changes how schema
digests are calculated, and affect the ABI of schema update
messages (adding a column changes other columns' indexes
in frozen_mutation).
To fix this, extend the schema_tables mechanism with support
for the cdc column, and adjust schemas and mutations to remove
that column when sending schemas during upgrade.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
An empty cdc column in scylla_tables is hashed differently from
a missing column. This causes schema mismatch when a schema is
propagated to another node, because the other node will redact
the schema column completely if the cluster feature isn't enabled,
and an empty value is hashed differently from a missing value.
Store a tombstone instead. Tombstones are removed before
digesting, so they don't affect the outcome.
This change also undoes the changes in 386221da84 ("schema_tables:
handle 'cdc' options") to schema_change_test
test_merging_does_not_alter_tables_which_didnt_change. That change
enshrined the breakage into the test, instead of fixing the root cause,
which was that we added an an extra mutation to the schema (for
cdc options, which were disabled).
Different versions of boost have different rules for what conversions
from cpp_int to smaller intergers are allowed.
We already had a function that worked with all supported versions, but
it was not being use by lua.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200104041028.215153-1-espindola@scylladb.com>
I noticed this while looking at the crashes next is currently
experiencing.
While I have no idea if this fixes the issue, it does avoid broken
future warnings (for no_sharded_instance_exception) in a debug build.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200103201540.65324-1-espindola@scylladb.com>
* seastar 0525bbb08...36cf5c5ff (6):
> memcached: Fix use after free in shutdown
> Revert "task: stop wrapping tasks with unique_ptr"
> task: stop wrapping tasks with unique_ptr
> http: Change exception formating to the generic seastar one
> Merge "Avoid a few calls to ~exception_ptr" from Rafael
> tests: fix core generation with asan
This patch adds a very comprehensive test for the ConditionExpression
feature, i.e., the newer syntax of conditional writes replacing
the old-style "Expected" - for the UpdateItem, PutItem and DeleteItem
operations.
I wrote these tests while closely following the DynamoDB ConditionExpression
documentation, and attempted to cover all conceivable features, subfeatures
and subcases of the ConditionExpression syntax - to serve as a test for a
future support for this feature in Alternator (see issue #5053).
As usual, all these tests pass on AWS DynamoDB, but because we haven't yet
implemented this feature in Alternator, all but one xfail on Alternator.
Refs #5053.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191229143556.24002-1-nyh@scylladb.com>
If Alternator is requested to be enabled on a specific port but the port is
already taken, the boot fails as expected - but the error log is confusing;
It currently looks something like this:
WARN 2019-12-24 11:22:57,303 [shard 0] alternator-server - Failed to set up Alternator HTTP server on 0.0.0.0 port 8000, TLS port 8043: std::system_error (error system:98, posix_listen failed for address 0.0.0.0:8000: Address already in use)
... (many more messages about the server shutting down)
INFO 2019-12-24 11:22:58,008 [shard 0] init - Startup failed: std::system_error (error system:98, posix_listen failed for address 0.0.0.0:8000: Address already in use)
There are two problems here. First, the "WARN" should really be an "ERROR",
because it causes the server to be shut down and the user must see this error.
Second, the final line in the log, something the user is likely to see first,
contains only the ultimate cause for the exception (an address already in use)
but not the information what this address was needed for.
This patch solves both issues, and the log now looks like:
ERROR 2019-12-24 14:00:54,496 [shard 0] alternator-server - Failed to set up Alterna
tor HTTP server on 0.0.0.0 port 8000, TLS port 8043: std::system_error (error system
:98, posix_listen failed for address 0.0.0.0:8000: Address already in use)
...
INFO 2019-12-24 14:00:55,056 [shard 0] init - Startup failed: std::_Nested_exception<std::runtime_error> (Failed to set up Alternator HTTP server on 0.0.0.0 port 8000, TLS port 8043): std::system_error (error system:98, posix_listen failed for address 0.0.0.0:8000: Address already in use)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191224124127.7093-1-nyh@scylladb.com>
We don't support yet the ReturnValues option on PutItem, UpdateItem or
DeleteItem operations (see issue #5053), but if a user tries to use such
an option anyway, we silently ignore this option. It's better to fail,
reporting the unsupported option.
In this patch we check the ReturnValues option and if it is anything but
the supported default ("NONE"), we report an error.
Also added a test to confirm this fix. The test verifies that "NONE" is
allowed, and something which is unsupported (e.g., "DOG") is not ignored
but rather causes an error.
Refs #5053.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191216193310.20060-1-nyh@scylladb.com>
These are flags we always want to enable. In particular, we want them
to be used by the bots, but the bots run this script with
--configure-flags, so they were being discarded.
We put the user option later so that they can override the common
options.
Fixes#5505
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Takuya ASADA <syuu@scylladb.com>
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
There is no requirement that all notes be placed in a single
PT_NOTE. It looks like recent lld's actually put each section in its
own PT_NOTE.
This change looks for build-id in all PT_NOTE headers.
Fixes#5525
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191227000311.421843-1-espindola@scylladb.com>
rpm compression uses xz, which is painfully slow. Adjust the
compression settings to run on all threads.
The xz utility documentation suggests that 0 threads is
equivalent to all CPUs, but apparently the library interface
(which rpmbuild uses) doesn't think the same way.
Message-Id: <20200101141544.1054176-1-avi@scylladb.com>
In the current code, support for case-sensitive (quoted) user-defined type
names is broken. For example, a test doing:
CREATE TYPE "PHone" (country_code int, number text)
CREATE TABLE cf (pk blob, pn "PHone", PRIMARY KEY (pk))
Fails - the first line creates the type with the case-sensitive name PHone,
but the second line wrongly ends up looking for the lowercased name phone,
and fails with an exception "Unknown type ks.phone".
The problem is in cql3_type_name_impl. This class is used to convert a
type object into its proper CQL syntax - for example frozen<list<int>>.
The problem is that for a user-defined type, we forgot to quote its name
if not lowercase, and the result is wrong CQL; For example, a list of
PHone will be written as list<PHone> - but this is wrong because the CQL
parser, when it sees this expression, lowercases the unquoted type name
PHone and it becomes just phone. It should be list<"PHone">, not list<PHone>.
The solution is for cql3_type_name_impl to use for a user-defined type
its get_name_as_cql_string() method instead of get_name_as_string().
get_name_as_cql_string() is a new method which prints the name of the
user type as it should be in a CQL expression, i.e., quoted if necessary.
The bug in the above test was apparently caused when our code serialized
the type name to disk as the string PHone (without any quoting), and then
later deserialized it using the CQL type parser, which converted it into
a lowercase phone. With this patch, the type's name is serialized as
"PHone", with the quotes, and deserialized properly as the type PHone.
While the extra quotes may seem excessive, they are necessary for the
correct CQL type expression - remember that the type expression may be
significantly more complex, e.g., frozen<list<"PHone">> and all of this,
including the quotes, is necessary for our parser to be able to translate
this string back into a type object.
This patch may cause breakage to existing databases which used case-
sensitive user-defined types, but I argue that these use cases were
already broken (as demonstrated by this test) so we won't break anything
that actually worked before.
Fixes#5544
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200101160805.15847-1-nyh@scylladb.com>
The class in question wants to run its own instances on different
shards, for this sake it keeps reference on sharded self to call
invoke_on() on. There's a handy peering_sharded_service<> in seastar
for the same, using it makes the code nicer and shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191226112401.23960-1-xemul@scylladb.com>
We had a lot of code in a .hh file, that while using templeates, was
only used from creating functions during startup.
This moves it to a new .cc file.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200101002158.246736-1-espindola@scylladb.com>
I can't quite figure out how we were trying to write a sstable with
the large data handler already stopped, but the backtrace suggests a
good place to add extra checks.
This patch adds two check. One at the start and one at the end of
sstable::write_components. The first one should give us better
backtraces if the large_data_handler is already stopped. The second
one should help catch some race condition.
Refs: #5470
Message-Id: <20191231173237.19040-1-espindola@scylladb.com>
The standard printer for atomic_cell prints the value as hex,
because atomic_cell does not include the type. Add a type-aware
printer that allows the user to provide the type.
When the product name is other than "scylla", the debian
packaging scripts go over all files that starts with "scylla-"
and change the prefix to be the actual product name.
However, if there are no such files in the directory
the script will fail since the renaming command will
get the wildcard string instrad of an actual file name.
This patch replaces the command with a command with
an equivalent desired effect that only operates on files
if there are any.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20191230143250.18101-1-eliransin@scylladb.com>
Since we merged /usr/lib/scylla with /opt/scylladb, we removed
/usr/lib/scylla and replace it with the symlink point to /opt/scylladb.
However, RPM does not support replacing a directory with a symlink,
we are doing some dirty hack using RPM scriptlet, but it causes
multiple issues on upgrade/downgrade.
(See: https://docs.fedoraproject.org/en-US/packaging-guidelines/Directory_Replacement/)
To minimize Scylla upgrading/downgrade issues on user side, it's better
to keep /usr/lib/scylla directory.
Instead of creating single symlink /usr/lib/scylla -> /opt/scylladb,
we can create symlinks for each setup scripts like
/usr/lib/scylla/<script> -> /opt/scylladb/scripts/<script>.
Fixes#5522Fixes#4585Fixes#4611
The '--builddir' option value is assigned to the "builddir" variable,
which is wrong. The correct variable is "BUILDDIR" so use that instead
to fix the '--builddir' option.
Also, add logging to the script when executing the "dist/redhat_build.rpm.sh"
script to simplify debugging.
Hit the following ubsan error with bootstrap_test:TestBootstrap.manual_bootstrap_test in debug mode:
service/storage_service.cc:3519:37: runtime error: load of value 190, which is not a valid value for type 'bool'
The use site is:
service::storage_service::is_cleanup_allowed(seastar::basic_sstring<char, unsigned int, 15u, true>)::{lambda(service::storage_service&)#1}::operator()(service::storage_service&) const at /local/home/bhalevy/dev/scylla/service/storage_service.cc:3519
While at it, initialize `_initialized` to false as well, just in case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Avoid following UBSAN error:
repair/row_level.cc:2141:7: runtime error: load of value 240, which is not a valid value for type 'bool'
Fixes#5531
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
A Linux machine typically has multiple clocksources with distinct
performances. Setting a high-performant clocksource might result in
better performance for ScyllaDB, so this should be considered whenever
starting it up.
This patch introduces the possibility of enforcing optimized Linux
clocksource to Scylla's setup/start-up processes. It does so by adding
an interactive question about enforcing clocksource setting to scylla_setup,
which modifies the parameter "CLOCKSOURCE" in scylla_server configuration
file. This parameter is read by perftune.py which, if set to "yes", proceeds
to (non persistently) setting the clocksource. On x86, TSC clocksource is used.
Fixes#4474Fixes#5474Fixes#5480
Instances of `variable_specifications` are passed around as
shared_ptr's, which are redundant in this case since the class
is marked as `final`. Use `lw_shared_ptr` instead since we know
for sure it's not a polymorphic pointer.
Tests: unit(debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20191225232853.45395-1-pa.solodovnikov@scylladb.com>
If any two directories of data/commitlog/hints/view_hints
are the same we still end up running verify_owner_and_mode
and disk_sanity(check_direct_io_support) in parallel
on the same directoriea and hit #5510.
This change uses std::set rather than std::vector to
collect a unique set of directories that need initialization.
Fixes#5510
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191225160645.2051184-1-bhalevy@scylladb.com>
db::commitlog::segment::batch_cycle() assumes that after a write
for a certain position completes (as reported by
_pending_ops.wait_for_pending()) it will also be flushed, but this is
true only if writing and flushing are atomic wrt _pending_ops lock.
It usually is unless flush_after is set to false when cycle() is
called. In this case only writing is done under the lock. This
is exactly what happens when a segment is closed. Flush is skipped
because zero header is added after the last entry and then flushed, but
this optimization breaks batch_cycle() assumption. Fix it by flushing
after the write atomically even if a segment is being closed.
Fixes#5496
Message-Id: <20191224115814.GA6398@scylladb.com>
The hints and view_hints directory has per-shard sub-dirs,
and the directories code tries to create, check and lock
all of them, including the base one.
The manipulations in question are excessive -- it's enough
to check and lock either the base dir, or all the per-shard
ones, but not everything. Let's take the latter approach for
its simplicity.
Fixes#5510
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Looks-good-to: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191223142429.28448-1-xemul@scylladb.com>
Pekka Enberg <penberg@scylladb.com> wrote:
> Image might not be present, but the subsequent "docker run" command will automatically pull it.
Just letting "docker run" fail produces kinda confusing error message,
referring to docker help, but the we want to provide the user
with our own help, so still fail early, just also try to pull the image
if "docker image inspect" failed, indicating it's not present locally.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191223085219.1253342-4-bhalevy@scylladb.com>
Suggested-by: Pekka Enberg <penberg@scylladb.com>
> This will print all the available Docker images,
> many (most?) of them completely unrelated.
> Why not just print an error saying that no image was specified,
> and then perhaps print usage.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191223085219.1253342-3-bhalevy@scylladb.com>
Add dbuild dependency on python3-colorama,
which will be used in test.py instead of a hand-made palette.
[avi: update tools/toolchain/image]
Message-Id: <20191223125251.92064-2-kostja@scylladb.com>
In this place we only need to know the number of endpoints,
while current code additionally shuffles them before counting.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two _identical_ methods in token_metadata class:
get_all_endpoints_count() and number_of_endpoints().
The former one is used (called) the latter one is not used, so
let's remove it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This greatly helps to narrow down the source of schema digest mismatch
between nodes. Intented use is to enable this logger on disagreeing
nodes and trigger schema digest recalculation and observe which
mutations differ in digest and then examine their content.
Message-Id: <1574872791-27634-1-git-send-email-tgrabiec@scylladb.com>
In commit b463d7039c (repair: Introduce
get_combined_row_hash_response), working_row_buf_nr is returned in
REPAIR_GET_COMBINED_ROW_HASH in addition to the combined hash. It is
scheduled to be part of 3.1 release. However it is not backported to 3.1
by accident.
In order to be compatible between 3.1 and 3.2 repair. We need to drop
the working_row_buf_nr in 3.2 release.
Fixes: #5490
Backports: 3.2
Tests: Run repair in a mixed 3.1 and 3.2 cluster
Changes summary:
* make `cql3::result_set` movable-only
* change signature of `cql3::result::result_set` to return by cref
* adjust available call sites to the aforementioned method to accept cref
Motivation behind this change is elimination of dangerous API,
which can easily set a trap for developers who don't expect that
result_set would be returned by value.
There is no point in copying the `result_set` around, so make
`cql3::result::result_set` to cache `result_set` internally in a
`unique_ptr` member variable and return a const reference so to
minimize unnecessary copies here and there.
Tests: unit(debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20191220115100.21528-1-pa.solodovnikov@scylladb.com>
So a higher level component using the validator to validate a stream can
catch only validation errors, and let any other incidental exception
through.
This allows building data correctors on top of the
`mutation_fragment_stream_validator`, by filtering a fragment stream
through a validator, catching invalid fragment stream exceptions and
dropping the respective fragments from the stream.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191220073443.530750-1-bdenes@scylladb.com>
This reverts commit 237ba74743. While it
works for the scylla executable, it fails for iotune, which is built
by seastar. It should be reinstated after we pass the correct link
parameters to the seastar build system.
"
These series solves an issue with scylla_setup and prevent it from
waiting forever if housekeeping cannot look for the new Scylla version.
Fixes#5302
It should be backported to versions that support offline installations.
"
* 'scylla_setup_timeout' of git://github.com/amnonh/scylla:
scylla_setup: do not wait forever if no reply is return housekeeping
scylla_util.py: Add optional timeout to out function
Having a long path allows patchelf to change the interpreter without
changing the PT_LOAD headers and therefore without moving the
build-id out of the first page.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191213224803.316783-1-espindola@scylladb.com>
Suppose we have a multi-dc setup (e.g. 9 nodes distributed across
3 datacenters: [dc1, dc2, dc3] -> [3, 3, 3]).
When a query that uses LWT is executed with LOCAL_SERIAL consistency
level, the `storage_proxy::get_paxos_participants` function
incorrectly calculates the number of required participants to serve
the query.
In the example above it's calculated to be 5 (i.e. the number of
nodes needed for a regular QUORUM) instead of 2 (for LOCAL_SERIAL,
which is equivalent to LOCAL_QUORUM cl in this case).
This behavior results in an exception being thrown when executing
the following query with LOCAL_SERIAL cl:
INSERT INTO users (userid, firstname, lastname, age) VALUES (0, 'first0', 'last0', 30) IF NOT EXISTS
Unavailable: Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level for cl LOCAL_SERIAL. Requires 5, alive 3" info={'required_replicas': 5, 'alive_replicas': 3, 'consistency': 'LOCAL_SERIAL'}
Tests: unit(dev), dtest(consistency_test.py)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20191216151732.64230-1-pa.solodovnikov@scylladb.com>
The actual buffer is now in a member called 'data'. Leave the old
`dummy.dummy` and `dummy` as fall-back. This seems to change every
Fedora release.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191218153544.511421-1-bdenes@scylladb.com>
Schema is node-global, update_schema_version_and_announce() updates
all shards. We don't need to recalculate it from every shard, so
install the listeners only on shard 0. Reduces noise in the logs.
Message-Id: <1574872860-27899-1-git-send-email-tgrabiec@scylladb.com>
The option in question apparently does not work, several sharded objects
are start()-ed (and thus instanciated) in join_roken_ring, while instances
themselves of these objects are used during init of other stuff.
This leads to broken seastar local_is_initialized assertion on sys_dist_ks,
but reading the code shows more examples, e.g. the auth_service is started
on join, but is used for thrift and cql servers initialization.
The suggestion is to remove the option instead of fixing. The is_joined
logic is kept since on-start joining still can take some time and it's safer
to report real status from the API.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191203140717.14521-1-xemul@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5366 from Calle Wilund:
Moves schema creation/alter/drop awareness to use new "before" callbacks from
migration manager, and adds/modifies log and streams table as part of the base
table modification.
Makes schema changes semi-atomic per node. While this does not deal with updates
coming in before a schema change has propagated cluster, it now falls into the
same pit as when this happens without CDC.
Added side effect is also that now schemas are transparent across all subsystems,
not just cql.
Patches:
cdc_test: Add small test for altering base schema (add column)
cdc: Handle schema changes via migration manager callbacks
migration_manager: Invoke "before" callbacks for table operations
migration_listener: Add empty base class and "before" callbacks for tables
cql_test_env: Include cdc service in cql tests
cdc: Add sharded service that does nothing.
cdc: Move "options" to separate header to avoid to much header inclusion
cdc: Remove some code from header
* seastar 00da4c8760...0525bbb08f (7):
> future: Simplify future_state_base::any move constructor
> future: don't create temporary tuple on future::get().
> future: don't instantiate new future on future::then_wrapped().
> future: clean-up the Result handling in then_wrapped().
> Merge "Fix core dumps when asan is enabled" from Rafael
> future: Move ignore to the base class
> future: Don't delete in ignore
Currently `SCYLLA-VERSION-GEN` is not a dependency of any target and
hence changes done to it will not be picked up by ninja. To trigger a
rebuild and hence version changes to appear in the `scylla` target
binary, one has to do `touch configure.py`. This is counter intuitive
and frustrating to people who don't know about it and wonder why their
changed version is not appearing as the output of `scylla --version`.
This patch makes `SCYLLA-VERSION-GEN` a dependency of `build.ninja,
making the `build.ninja` target out-of-date whenever
`SCYLLA-VERSION-GEN` is changed and hence will trigger a rerun of
`configure.py` when the next target is built, allowing a build of e.g.
`scylla` to pick up any changes done to the version automatically.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191217123955.404172-1-bdenes@scylladb.com>
"
This patch set rearranges the test files so that
it is now possible to search for tests automatically,
and adds this functionality to test.py
"
* 'test.py.requeue' of ssh://github.com/scylladb/scylla-dev:
cmake: update CMakeLists.txt to scan test/ rather than tests/
test.py: automatically lookup all unit and boost tests
tests: move all test source files to their new locations
tests: move a few remaining headers
tests: move another set of headers to the new test layout
tests: move .hh files and resources to new locations
tests: remove executable property from data_listeners_test.cc
When scylla is installed without a network connectivity, the test if a
newer version is available can cause scylla_setup to wait forever.
This patch adds a limit to the time scylla_setup will wait for a reply.
When there is no reply, the relevent error will be shown that it was
unable to check for newer version, but this will not block the setup
script.
Fixes#5302
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5343 from
Benny Halevy.
Fixes#5340
Hold the sstable_deletion_sem table::move_sstables_from_subdirs to
serialize access to the staging directory. It now synchronizes snapshot,
compaction deletion of sstables, and view_update_generator moving of
sstables from staging.
Tests:
unit (dev) [expect test_user_function_timestamp_return that fails for me locally, but also on master]
snapshot_test.py (dev)
I used the following as a reference:
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/virtual/ClientsTable.java
At this moment there is only info about IP, clients outgoing port,
client 'type' (i.e. CQL/thrift/alternator), shard ID and username.
Column `request_count' is NOT present and CK consists of
(`port', `client_type'), contrary to what C*'s has: (`port').
Code that notifies `system.clients` about new connections goes
to top-level files `connection_notifier.*`. Currently only CQL
clients are observed, but enum `client_type` can be used in future
to notify about connections with other protocols.
Hold the _sstable_deletion_sem while moving sstables from the staging directory
so not to move them under the feet of table::snapshot.
Fixes#5340
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Consumer may throw, in this case, break from the loop and retry.
move_sstable_from_staging_in_thread may theoretically throw too,
ignore the error in this case since the sstable was already processed,
individual move failures are already ignored and moving from staging
will be retried upon restart.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used for "batch" move of several sstables from staging
to the base directory, allowing the caller to sync the directories
once when all are moved rather than for each one of them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
distributed_loader::probe_file needlessly creates a seastar
thread for it and the next patch will use it as part of
a parallel_for_each loop to move a list of sstables
(and sync the directories once at the end).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We do not yet support the parallel Scan options (TotalSegments, Segment),
as reported in issue #5059. But even before implementing this feature, it
is important that we produce an error if a user attempts to use it - instead
of outright ignoring this parameter. This is what this patch does.
The patch also adds a full test, test_scan.py::test_scan_parallel, for the
parallel scan feature. The test passes on DynamoDB, and still xfails
on Alternator after this patch - but now the Scan request fails immediately
reporting the unsupported option - instead of what the pre-patch code did:
returning the wrong results and the test failing just when the results
do not match the expectations.
Refs #5059.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191217084917.26191-1-nyh@scylladb.com>
"
Only the first patch is needed to fix the undefined behavior, but the
followup ones simplify the memory management around user types.
"
* 'espindola/fix-5193-v2' of ssh://github.com/espindola/scylla:
db: Don't use lw_shared_ptr for user_types_metadata
user_types_metadata: don't implement enable_lw_shared_from_this
cql3: pass a const user_types_metadata& to prepare_internal
db: drop special case for top level UDTs
db: simplify db::cql_type_parser::parse
db: Don't create a reference to nullptr
Add test for loading a schema with a non native type
1. Move tests to test (using singular seems to be a convention
in the rest of the code base)
2. Move boost tests to test/boost, other
(non-boost) unit tests to test/unit, tests which are
expected to be run manually to test/manual.
Update configure.py and test.py with new paths to tests.
Move sstable_test.hh, test_table.hh and cql_assertions.hh from tests/ to
test/lib or test/boost and update dependent .cc files.
Move tests/perf_sstable.hh to test/perf/perf_sstable.hh
Move another small subset of headers to test/
with the same goals:
- preserve bisectability
- make the revision history traceable after a move
Update dependent files.
The plan is to move the unstructured content of tests/ directory
into the following directories of test/:
test/lib - shared header and source files for unit tests
test/boost - boost unit tests
test/unit - non-boost unit tests
test/manual - tests intended to be run manually
test/resource - binary test resources and configuration files
In order to not break git bisect and preserve the file history,
first move most of the header files and resources.
Update paths to these files in .cc files, which are not moved.
We're seeing the following error from test from time to time:
fatal error: in "test_allocation_failure": std::runtime_error: Did not get expected exception from writing too large record
This is not reproducible and the error string does not contain
enough information to figure out what happened exactly, therefore
this patch adds an exception if the call succeeded unexpectedly
and also prints the unexpected exception if one was caught.
Refs #4714
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191215052434.129641-1-bhalevy@scylladb.com>
The is_podman check was depending on `docker -v` printing "podman" in
the output, but that doesn't actually work, since podman prints $0.
Use `docker --help` instead, which will output "podman".
Also return podman's return status, which was previously being
dropped.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
On start there are two things that scylla does on data/commitlog/etc.
dirs: locks and verifies permissions. Right now these two actions are
managed by different approaches, it's convenient to merge them.
Also the introduced in this set directories class makes a ground for
better --workdir option handling. In particular, right now the db::config
entries are modified after options parse to update directories with
the workdir prefix. With the directories class at hands will be able
to stop doing this.
"
* 'br-directories-cleanup' of https://github.com/xemul/scylla:
directories: Make internals work on fs::path
directories: Cleanup adding dirs to the vector to work on
directories: Drop seastar::async usage
directories: Do touch_and_lock and verify sequentially
directories: Do touch_and_lock in parallel
directories: Move the whole stuff into own .cc file
directories: Move all the dirs code into .init method
file_lock: Work with fs::path, not sstring
This reverts commit 4333b37f9e. It breaks upgrades,
and the user question is not informative enough for the user to make a correct
decision.
Fixes#5478.
Fixes#5480.
The unordered_set is turned into vector since for fs::path
there's no hash() method that's needed for set.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the only future-able operation remained is the call to
parallel_for_each(), all the rest is non-blocking preparation,
so we can drop the seastar::async and just return the future
from parallel_for_each.
The indendation is now good, as in previous patch is was prepared
just for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to drop the seastar::async() usage.
Currently we have two places that return futures -- calls to
parallel_for_each-s. We can either chain them together or,
since both are working on the same set of directories, chain
actions inside them.
For code simplicity I propose to chain actions.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The list of paths that should be touch-and-locked is already
at hands, this shortens the code and makes it slightly faster
(in theory).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order not to pollute the root dir place the code in
utils/ directory, "utils" namespace.
While doing this -- move the touch_and_lock from the
class declaration.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The seastar::async usage is tempoarary, added for bisect-safety,
soon it will go away. For this reason the indentation in the
.init method is not "canonical", but is prepared for one-patch
drop of the seastar::async.
The hinted_handoff_enabled arg is there, as it's not just a
parameter on config, it had been parsed in main.cc.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The main.cc code that converts sstring to fs::path
will be patched soon, the file_desc::open belongs
to seastar and works on sstrings.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
LWT is much more efficient if a request is processed on a shard that owns
a token for the request. This is because otherwise the processing will
bounce to an owning shard multiple times. The patch proposes a way to
move request to correct shard before running lwt. It works by returning
an error from lwt code if a shard is incorrect one specifying the shard
the request should be moved to. The error is processed by the transport
code that jumps to a correct shard and re-process incoming message there.
"
* 'gleb/bounce_lwt_request' of github.com:scylladb/seastar-dev:
lwt: take raw lock for entire cas duration
lwt: drop invoke_on in paxos_state prepare and accept
lwt: Process lwt request on a owning shard
storage_service: move start_native_transport into a thread
transport: change make_result to takes a reference to cql result instead of shared_ptr
The implementation of Expected's BEGINS_WITH operator on blobs was
incorrect, naively comparing the base64-encoded strings, which doesn't
work. This patches fixes the code to compare the decoded strings.
The reason why the BEGINS_WITH test missed this bug was that we forgot
to check the blob case and only tested the string case; So this patch
also adds the missing test - which reproduces this bug, and verifies
its fix.
Fixes#5457
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191211115526.29862-1-nyh@scylladb.com>
The LIKE operator requires filtering, so needs_filtering() must check
is_LIKE(). This already happens for partition columns, but it was
overlooked for clustering columns in the initial implementation of
LIKE.
Fixes#5400.
Tests: unit(dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
To be used by dtest as an indicator that endpoint's hints
were drained and hints directory is removed.
Refs #5354
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The user_types_metadata can simply be owned by the keyspace. This
simplifies the code since we never have to worry about nulls and the
ownership is now explicit.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
It looks like this was done just to avoid including
user_types_metadata.hh, which seems a bit much considering that it
requires adding specialization to the seastar namespace.
A followup patch will also stop using lw_shared_ptr for
user_types_metadata.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We never modify the user_types_metadata via prepare_internal, so we
can pass it a const reference.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This was originally done in 7f64a6ec4b,
but that commit was reverted in reverted in
8517eecc28.
The revert was done because the original change would call parse_raw
for non UDT types. Unlike the old patch, this one doesn't change the
behavior of non UDT types.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The variant of db::cql_type_parser::parse that has a
user_types_metadata argument was only used from the variant that
didn't. This inlines one in the other.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The user_types variable can be null during db startup since we have to
create types before reading the system table defining user types.
This avoids undefined behavior, but is unlikely that it was causing
more serious problems since the variable is only used when creating
user types and we don't create any until after all system tables are
read, in which case the user_types variable is not null.
Fixes#5193
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
LWT is much more efficient if a request is processed on a shard that owns
a token for the request. This is because otherwise the processing will
bounce to an owning shard multiple times. The patch proposes a way to
move request to correct shard before running lwt. It works by returning
an error from lwt code if a shard is incorrect one specifying the shard
the request should be moved to. The error is processed by transport code
that jumps to a correct shard and re-process incoming message there.
This patch adds comprehensive tests for the ReturnValue parameter of
the write operations (PutItem, UpdateItem, DeleteItem), which can return
pre-write or post-write values of the modified item. The tests are in
a new test file, alternator-test/test_returnvalues.py.
This feature is not yet implemented in Alternator, so all the new
tests xfail on Alternator (and all pass on AWS).
Refs #5053
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191127163735.19499-1-nyh@scylladb.com>
This patch adds tests for Query's "ScanIndexForward" parameter, which
can be used to return items in reversed sort order.
We test that a Limit works and returns the given number of *last* items
in the sort order, and also that such reverse queries can be resumed,
i.e., paging works in the reverse order.
These tests pass against AWS DynamoDB, but fail against Alternator (which
doesn't support ScanIndexForward yet), so it is marked xfail.
Refs #5153.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191127114657.14953-1-nyh@scylladb.com>
The JMX interface is implemented by the scylla-jmx project, not scylla.
Therefore, let's remove this historical reference to MBeans from
storage_proxy.
Message-Id: <20191211121652.22461-1-penberg@scylladb.com>
"
Add --experimental-features -- a vector of features to unlock. Make corresponding changes in the YAML parser.
Fixes#5338
"
* 'vecexper' of https://github.com/dekimir/scylla:
config: Add `experimental_features` option
utils: Add enum_option
* seastar e440e831c8...00da4c8760 (7):
> Merge "reactor: fix iocb pool underflow due to unaccounted aio fsync" from Avi
Fixes#5443.
> install-dependencies.sh: fix arch dependencies
> Merge " rpc: fix use-after-free during rpc teardown vs. rpc server message handling" from Benny
> Merge "testing: improve the observability of abandoned failed futures" from Botond
> rework the fair_queue tester
> directory_test: Update to use run instead of run_deprecated
> log: support fmt 6.0 branch with chrono.h for log
In the calculate_delay() code for view-backlog flow control, we calculate
a delay and cap it at a "budget" - the remaining timeout. This timeout is
measured in milliseconds, but the capping calculation converted it into
microseconds, which overflowed if the timeout is very large. This causes
some tests which enable the UB sanitizer to fail.
We fix this problem by comparing the delay to the budget in millisecond
resolution, not in microsecond resolution. Then, if the calculated delay
is short enough, we return it using its full microsecond resolution.
Fixes#5412
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191205131130.16793-1-nyh@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5453
from Piotr Sarna:
Checking the EQ relation for alternator attributes is usually performed
simply by comparing underlying JSON objects, but sets (SS, BS, NS types)
need a special routine, as we need to make sure that sets stored in
a different order underneath are still equal, e.g:
[1, 3, 2] == [1, 2, 3]
Fixes#5021
Checking the EQ relation for alternator attributes is usually performed
simply by comparing underlying JSON objects, but sets (SS, BS, NS types)
need a special routine, as we need to make sure that sets stored in
a different order underneath are still equal, e.g:
[1, 3, 2] == [1, 2, 3]
Fixes#5021
If a set of mutations contains both an entry that deletes a table
and an entry that adds a table with the same name, it's expected
to be a replacement operation (delete old + create new),
rather than a useless "try to create a table even though it exists
already and then immediately delete the original one" operation.
As such, notifications about the deletions should be performed
before notifications about the creations. The place that originally
suffered from this wrong order is view building - which in this case
created an incorrect duplicated entry in the view building bookkeeping,
and then immediately deleted it, resulting in having old, deprecated
entries with stale UUIDS lying in the build queue and never proceeding,
because the underlying table is long gone.
The issue is fixed by ensuring the order of notifications:
- drops are announced first, view drops are announced before table drops;
- creations follow, table creations are announced before views;
- finally, changes to tables and views are announced;
Fixes#4382
Tests: unit(dev), mv_populating_from_existing_data_during_node_stop_test
Iterate over an array holding all rpm names to see if any
of them is missing from `dist/ami/files`. If they are missing,
look them up in build/redhat/RPMS/x86_64 so that if reloc/build_rpm.sh
was run manually before dist/ami/build_ami.sh we can just collect
the built rpms from its output dir.
If we're still missing any rpms, then run reloc/build_rpm.sh
and copy the required rpms from build/redhat/RPMS/x86_64.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Glauber Costa <glauber@scylladb.com>
In swagger 1.2 int is defined as int32.
We originally used int following the jmx definition, in practice
internally we use uint and int64 in many places.
While the API format the type correctly, an external system that uses
swagger-based code generator can face a type issue problem.
This patch replace all use of int in a return type with long that is defined as int64.
Changing the return type, have no impact on the system, but it does help
external systems that use code generator from swagger.
Fixes#5347
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Provide some explanation on prio strings + direction to gnutls manual.
Document client auth option.
Remove confusing/misleading statement on "custom options"
Message-Id: <20191210123714.12278-1-calle@scylladb.com>
Enable existing NOT_CONTAINS test, add NOT_CONTAINS to the list of
recognized operators, implement check_NOT_CONTAINS, and hook it up to
verify_expected_one().
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the user wants to turn on only some experimental features, they
can use this new option. The existing `experimental` option is
preserved for backwards compatibility.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Almost all commands provided by `scylla-gdb.py` are safe to use. The
worst that could happen if they fail is that you won't get the desired
information. There is one notable exception: `scylla thread`. If
anything goes wrong while this command is executed - gdb crashes, a bug
in the command, etc. - there is a good change the process under
examination will crash. Sometimes this is fine, but other times e.g.
when live debugging a production node, this is unacceptable.
To avoid any accidents add documentation to all commands working with
`seastar::thread`. And since most people don't read documentation,
especially when debugging under pressure, add a safety net to the
`scylla thread` command. When run, this command will now warn of the
dangers and will ask for explicit acknowledgment of the risk of crash,
by means of passing an `--iamsure` flag. When this flag is missing, it
will refuse to run. I am sure this will be very annoying but I am also
sure that the avoided crashes are worth it.
As part of making `scylla thread` safe, its argument parsing code is
migrated to `argparse`. This changes the usage but this should be fine
because it is well documented.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191129092838.390878-1-bdenes@scylladb.com>
The ami description attribute is only allowed to be 255
characters long. When build_ami.sh generates an ami, it
generates an ami description which is a concatenation
of all of the componnents version strings. It can
happen that the description string is too long which
eventually causes the ami build to fail. This patch
trims the description string to 255 characters.
It is ok since the individual versions of the components
are also saved in tags attached to the image.
Tests:
1. Reproduced with a long description and
validated that it doesn't fail after the fix.
Fixes#5435
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20191209141143.28893-1-eliransin@scylladb.com>
A Linux machine typically has multiple clocksources with distinct
performances. Setting a high-performant clocksource might result in
better performance for ScyllaDB, so this should be considered whenever
starting it up.
This patch introduces the possibility of enforcing optimized Linux
clocksource to Scylla's setup/start-up processes. It does so by adding
an interactive question about enforcing clocksource setting to scylla_setup,
which modifies the parameter "CLOCKSOURCE" in scylla_server configuration
file. This parameter is read by perftune.py which, if set to "yes", proceeds
to (non persistently) setting the clocksource. On x86, TSC clocksource is
used.
Fixes#4474
This allows us to create/alter/drop log and desc tables "atomically"
with the base, by including these mutations in the original mutation
set, i.e. batch create/alter tables.
Note that population does not happen until types are actually
already put into database (duh), thus there _is_ still a gap
between creating cdc and it being truly usable. This may or may
not need handling later.
A general build system knows about 3 machines:
* build: where the building is running
* host: where the built software will run
* target: the machine the software will produce code for
The target machine is only relevant for compilers, so we can ignore
it.
Until now we could ignore the build and host distinction too. This
patch adds the first difference: don't use host ld_flags when linking
build tools (gen_crc_combine_table).
The reason for this change is to make it possible to build with
-Wl,--dynamic-linker pointing to a path that will exist on the host
machine, but may not exist on the build machine.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191207030408.987508-1-espindola@scylladb.com>
Empty base type makes for less boiler plate in implementations.
The "before" callbacks are for listeners who need to potentially
react/augment type creation/alteration _before_ actually
committing type to schema tables (and holding the semaphore for this).
I.e. it is for cdc to add/modify log/desc tables "atomically" with base.
The vm.swappiness sysctl controls the kernel's prefernce for swapping
anonymous memory vs page cache. Since Scylla uses very large amounts
of anonymous memory, and tiny amounts of page cache, the correct setting
is to prefer swapping page cache. If the kernel swaps anonymous memory
the reactor will stall until the page fault is satisfied. On the other
hand, page cache pages usually belong to other applications, usually
backup processes that read Scylla files.
This setting has been used in production in Scylla Cloud for a while
with good results.
Users can opt out by not installing the scylla-kernel-conf package
(same as with the other kernel tunables).
* seastar 166061da3...e440e831c (8):
> Fail tests on ubsan errors
> future: make a couple of asserts more strict
> future: Move make_ready out of line
> config: Do not allow zero rates
Fixes#5360
> future: add new state to avoid temporaries in get_available_state().
> future: avoid temporary future_state on get_available_state().
> future: inline future::abandoned
> noncopyable_function: Avoid uninitialized warning on empty types
Previously, scylla used min/max(blob)->blob overload for collections,
tuples and UDTs; effectively making the results being printed as blobs.
This PR adds "dynamically"-typed min()/max() functions for compound types.
These types can be complicated, like map<int,set<tuple<..., and created
in runtime, so functions for them are created on-demand,
similarly to tojson(). The comparison remains unchanged - underneath
this is still byte-by-byte weak lex ordering.
Fixes#5139
* jul-stas/5139-minmax-bad-printing-collections:
cql_query_tests: Added tests for min/max/count on collections
cql3: min()/max() for collections/tuples/UDTs do not cast to blobs
This tests new min/max function for collections and tuples. CFs
in test suite were named according to types being tested, e.g.
`cf_map<int,text>' what is not a valid CF name. Therefore, these
names required "escaping" of invalid characters, here: simply
replacing with '_'.
Before:
cqlsh> insert into ks.list_types (id, val) values (1, [3,4,5]);
cqlsh> select max(val) from ks.list_types;
system.max(val)
------------------------------------------------------------
0x00000003000000040000000300000004000000040000000400000005
After:
cqlsh> select max(val) from ks.list_types;
system.max(val)
--------------------
[3, 4, 5]
This is accomplished similarly to `tojson()`/`fromjson()`: functions
are generated on demand from within `cql3::functions::get()`.
Because collections can have a variety of types, including UDTs
and tuples, it would be impossible to statically define max(T t)->T
for every T. Until now, max(blob)->blob overload was used.
Because `impl_max/min_function_for` is templated with the
input/output type, which can be defined in runtime, we need type-erased
("dynamic") versions of these functors. They work identically, i.e.
they compare byte representations of lhs and rhs with
`bytes::operator<`.
Resolves#5139
If you merge a pull request that contains multiple patches via
the github interface, it will document itself as the committer.
Work around this brain damage by using the command line.
Introduce a new verb dedicated for receiving and sending hints: HINT_MUTATION. It is handled on the streaming connection, which is separate from the one used for handling mutations sent by coordinator during a write.
The intent of using a separate connection is to increase fairness while handling hints and user requests - this way, a situation can be avoided in which one type of requests saturate the connection, negatively impacting the other one.
Information about new RPC support is propagated through new gossip feature HINTED_HANDOFF_SEPARATE_CONNECTION.
Fixes#4974.
Tests: unit(release)
To make gms::inet_address::to_string() similar in output to origin.
The sole purpose being quick and easy fix of API/JMX ipv6
formatting of endpoints etc, where strings are used as lexical
comparisons instead of textual representation.
A better, but more work, solution is to fix the scylla-jmx
bridge to do explicit parse + re-format of addresses, but there
are many such callpoints.
An even better solution would be to fix nodetool to not make this
mistake of doing lexical comparisons, but then we risk breaking
merge compatibility. But could be an option for a separate
nodeprobe impl.
Message-Id: <20191204135319.1142-1-calle@scylladb.com>
Currently query_options objects is passed to a trace stopping function
which makes it mandatory to make them alive until the end of the
query. The reason for that is to add prepared statement parameters to
the trace. All other query options that we want to put in the trace are
copied into trace_state::params_values, so lets copy prepared statement
parameters there too. Trace enabled case will become a little bit more
expensive but on the other hand we can drop a continuation that holds
query_options object alive from a fast path. It is safe to drop the call
to stop_foreground_prepared() here since The tracing will be stopped
in process_request_one().
Message-Id: <20191205102026.GJ9084@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5381 by
Peng Jian, fixing multiple small issues with Redis:
* Rename the options related to Redis API, and describe them clearly.
* Rename redis_transport_port to redis_port
* Rename redis_transport_port_ssl to redis_ssl_port
* Rename redis_default_database_count to redis_database_count
* Remove unnecessary option enable_redis_protocol
* Modify the default value of opition redis_read_consistency_level and redis_write_consistency_level to LOCAL_QUORUM
* Fix the DEL command: support to delete mutilple keys in one command.
* Fix the GET command: return the empty string when the required key is not exists.
* Fix the redis-test/test_del_non_existent_key: mark xfail.
On aarch64, asan detected a use-after-move. It doesn't happen on x86_64,
likely due to different argument evaluation order.
Fix by evaluating full_slice before moving the schema.
Note: I used "auto&&" and "std::move()" even though full_slice()
returns a reference. I think this is safer in case full_slice()
changes, and works just as well with a reference.
Fixes#5419.
This commit makes sure that single-partition readers for
read-before-write do not have fast-forwarding enabled,
as it may lead to huge read amplification. The observed case was:
1. Creating an index.
CREATE INDEX index1 ON myks2.standard1 ("C1");
2. Running cassandra-stress in order to generate view updates.
cassandra-stress write no-warmup n=1000000 cl=ONE -schema \
'replication(factor=2) compaction(strategy=LeveledCompactionStrategy)' \
keyspace=myks2 -pop seq=4000000..8000000 -rate threads=100 -errors
skip-read-validation -node 127.0.0.1;
Without disabling fast-forwarding, single-partition readers
were turned into scanning readers in cache, which resulted
in reading 36GB (sic!) on a workload which generates less
than 1GB of view updates. After applying the fix, the number
dropped down to less than 1GB, as expected.
Refs #5409Fixes#4615Fixes#5418
This test execution time dominates by a serious margin
test execution time in dev/release mode: reducing its
execution time improves the test.py turnaround by over 70%.
Message-Id: <20191204135315.86374-2-kostja@scylladb.com>
Currently, 'scylla thread' uses arch_prctl() to extract the value of
fsbase, used to reference thread local variables. gdb 8 added support
for directly accessing the value as $fs_base, so use that instead. This
works from core dumps as well as live processes, as you don't need to
execute inferior functions.
The patch is required for debugging threads in core dumps, but not
sufficient, as we still need to set $rip and $rsp, and gdb still[1]
doesn't allow this.
[1] https://sourceware.org/bugzilla/show_bug.cgi?id=9370
The feature introduced by this commit declares that hints can be sent
using the new dedicated RPC verb. Before using the new verb, nodes need
to know if other nodes in the cluster will be able to handle the new
RPC verb.
Introduce a new verb dedicated for receiving and sending hints:
HINT_MUTATION. It is handled on the streaming connection, which is
separate from the one used for handling mutations sent by coordinator
during a write.
The intent of using a separate connection is to increase fariness while
handling hints and user requests - this way, a situation can be avoided
in which one type of requests saturate the connection, negatively
impacting the other one.
"
In several cases in distributed testing (dtest) we trigger compaction using nodetool compact assuming that when it is done, it is indeed really done.
However, the way compaction is currently implemented in scylla, it may leave behind some background tasks to delete the old sstables that were compacted.
This commit changes major compaction (triggered via the ss::force_keyspace_compaction api) so it would wait on the background deletes and will return only when they finish.
Fixes#4909
Tests: unit(dev), nodetool_refresh_with_data_perms_test, test_nodetool_snapshot_during_major_compaction
"
We may able to use chrony setup script on future version of RHEL/CentOS,
it better to run chrony setup when RHEL version >= 8, not only 8.
Note that on Fedora it still provides ntp/ntpdate package, so we run
ntp setup on it for now. (same on debian variants)
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191203192812.5861-1-syuu@scylladb.com>
`segment_manager' now uses a decorated version of `timed_out_error'
with hardcoded name. On the other hand `region_group' uses named
`on_request_expiry' within its `expiring_fifo'.
Fixes#5211
In 79935df959 replay apply-call was
changed from one with no continuation to one with. But the frozen
mutation arg was still just lambda local.
Change to use do_with for this case as well.
Message-Id: <20191203162606.1664-1-calle@scylladb.com>
Exception messages contain semaphore's name (provided in ctor).
This affects the queue overflow exception as well as timeout
exception. Also, custom throwing function in ctor was changed
to `prethrow_action', i.e. metrics can still be updated there but
now callers have no control over the type of the exception being
thrown. This affected `restricted_reader_max_queue_length' test.
`reader_concurrency_semaphore'-s docs are updated accordingly.
In a build configured with --debuginfo 0 the scylla binary still ends
up with some debug info from the libraries that are statically linked
in.
We should avoid compiling subprojects (including seastar) with debug
info when none is needed, but this at least avoids it showing up in
the binary.
The main motivation for this is that it is confusing to get a binary
with *some* debug info in it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191127215843.44992-1-espindola@scylladb.com>
This series refactors the collection de/serialization code to use
fragmented buffers, avoiding the large allocations and the associated
pains when working with large collections. Currently all operations that
involve collections require deserializing them, executing the operation,
then serializing them again to their internal storage format. The
de/serialization operations happen in linearized buffers, which means
that we have to allocate a buffer large enough to hold the *entire*
collection. This can cause immense pressure on the memory allocator,
which, in the face of memory fragmentation, might be unable to serve the
allocation at all. We've seen this causing all sorts of nasty problems,
including but not limited to: failing compactions, failing memtable
flush, OOM crash and etc.
Users are strongly discouraged from using large collections, yet they
are still a fact of life and have been haunting us since forever.
The proper solution for these problems would be to come up with an
in-memory format for collections, however that is a major effort, with a
lot of unknowns. This is something we plan on doing at some point but
until it happens we should make life less painful for those with large
collections.
The goal of this series is to avoid the need of allocating these large
buffers. Serialization now happens into a `bytes_ostream` which
automatically fragments the values internally. Deserialization happens
with `utils::linearizing_input_stream` (introduced by this series), which
linearizes only the individual collection cells, but not the entire
collection.
An important goal of this series was to introduce the least amount of
risk, and hence the least amount of code. This series does not try to
make a revolution and completely revamp and optimize the
de/serialization codepaths. These codepaths have their days numbered so
investing a lot of effort into them is in vain. We can apply incremental
optimizations where we deem it necessary.
Fixes: #5341
Support to delete multiple keys in one DEL command.
The feature of returning number of the really deleted keys is still not supported.
Return empty string to client for GET command when the required key is not exists.
Fixes: #5334
Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>
Rename option redis_transport_port to redis_port, which the redis transport listens on for clients.
Rename option redis_transport_port_ssl to redis_ssl_port, which the redis TLS transport listens on for clients.
Rename option redis_database_count. Set the redis dabase count.
Rename option redis_keyspace_opitons to redis_keyspace_replication_strategy_options. Set the replication strategy for redis keyspace.
Remove option enable_redis_protocol, which is unnecessary.
Fixes: #5335
Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>
Commit 96009881d8 added diffutils to the dependencies via
Seastar's install-dependencies.sh, after it was inadvertantly
dropped in 1164ff5329 (update to Fedora 31; diffutils is no
longer brought in as a side effect of something else).
Regenerate the image to include diffutils.
Ref #5401.
This patch added subtests for EOF process, it reads and writes the socket
directly by using protocol cmds.
We can add more tests in future, tests with Redis module will hide some
protocol error.
Signed-off-by: Amos Kong <amos@scylladb.com>
podman needs to relabel directories in exactly the same cases docker
does. The difference is that podman cannot relabel /tmp.
The reason it was working before is that in practice anyone using
dbuild has already relabeled any directories that need relabeling,
with the exception of /tmp, since it is recreated on every boot.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191201235614.10511-2-espindola@scylladb.com>
Use `utils::linearizing_input_stream` for the deserizalization of the
collection. Allows for avoiding the linearization of the entire cell
value, instead only linearizing individual values as they are
deserialized from the buffer.
`linearizing_input_stream` allows transparently reading linearized
values from a fragmented buffer. This is done by linearizing on-the-fly
only those read values that happen to be split across multiple
fragments. This reduces the size of the largest allocation from the size
of the entire buffer (when the entire buffer is linearized) to the size
of the largest read value. This is a huge gain when the buffer contains
loads of small objects, and modest gains when the buffer contains few
large objects. But the even in the worst case the size of the largest
allocation will be less or equal compared to the case where the entire
buffer is linearized.
This stream is planned to be used as glue code between the fragmented
cell value and the collection deserialization code which expects to be
reading linearized values.
Currently the loop which writes the data from the fragmented origin to
the destination, moves to the next chunk eagerly after writing the value
of the current chunk, if the current chunk is exhausted.
This presents a problem when we are writing the last piece of data from
the last chunk, as the chunk will be exhausted and we eagerly attempt to
move to the next chunk, which doesn't exist and dereferencing it will
fail. The solution is to not be eager about moving to the next chunk and
only attempt it if we actually have more data to write and hence expect
more chunks.
The presence of `const_iterator` seems to be a requirement as well
although it is not part of the concept. But perhaps it is just an
assumption made by code using it.
Not just bytes::output_iterator. Allow writing into streams other than
just `bytes`. In fact we should be very careful with writing into
`bytes` as they require potentially large contiguous allocations.
The `write()` method is now templatized also on the type of its first
argument, which now accepts any CharOutputIterator. Due to our poor
usage of namespace this now collides with `write` defined inside
`db/commitlog/commitlog.cc`. Luckily, the latter doesn't really have to
be templatized on the data type it reads from, and de-templatizing it
resolves the clash.
Currently interactive RAID setup prompt does not list virtio-blk devices due to
following reasons:
- We fail matching '-p' option on 'lsblk --help' output since misusage of
regex functon, list_block_devices() always skipping to use lsblk output.
- We don't check existance of /dev/vd* when we skipping to use lsblk.
- We mistakenly excluded virtio-blk devices on 'lsblk -pnr' output using '-e'
option, but we actually needed them.
To fix the problem we need to use re.search() instead of re.match() to match
'-p' option on 'lsblk --help', need to add '/dev/vd*' on block device list,
then need to stop '-e 252' option on lsblk which excludes virtio-blk.
Additionally, it better to parse 'TYPE' field of lsblk output, we should skip
'loop' devices and 'rom' devices since these are not disk devices.
Fixes#4066
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191201160143.219456-1-syuu@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5392 from
Dejan Mircevski.
Refs #5034
The patches:
alternator: Implement LE operator in Expected
alternator: Implement GE operator in Expected
alternator: Make cmp diagnostic a value, not funct
utils: Add operator<< for big_decimal
alternator: Implement BETWEEN operator in Expected
All check_compare diagnostics are static strings, so there's no need
to call functions to get them. Instead of a function, make diagnostic
a simple value.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"Fix two problem in scylla_io_setup:
- Problem 1: paths of default directories is invalid, introduced by
commit 5ec1915 ("scylla_io_setup: assume default directories under
/var/lib/scylla").
- Problem 2: wrong path join, introduced by commit 31ddb21
("dist/common/scripts: support nonroot mode on setup scripts").
Fix a problem in scylla_io_setup, scylla_fstrim and scylla_blocktune.py:
- Fixed default scylla directories when they aren't assigned in
scylla.yaml"
Fixes#5370
Reviewed-by: Pavel Emelyanov <xemul@scylladb.com>
* 'scylla_io_setup' of git://github.com/amoskong/scylla:
use parse_scylla_dirs_with_default to get scylla directories
scylla_io_setup: fix data_file_directories check
scylla_util: introduce helper to process the default scylla directories
scylla_util: get workdir by datadir() if it's not assigned in scylla.yaml
scylla_io_setup: fix path join of default scylla directories
Use asyncio as a more modern way to work with concurrency,
Process signals in an event loop, terminate all outstanding
tests before exiting.
Breaking change: this commit requires Python 3.7 or
newer to run this script. The patch adds a version
check and a message to enforce it.
Similar to trace_state keep shared_ptr<tracing> _local_tracing_ptr
in one_session_records when constructed so it can be used
during shutdown.
Fixes#5243
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
_user cannot outlive client_state class instance, so there is no point
in holding it in shared_ptr.
Tested: debug test.py and dtest auth_test.py
Message-Id: <20191128131217.26294-5-gleb@scylladb.com>
Only do_stop_rpc_server uses the shared_ptr to prolong server's
lifetime until stop() completes, but do_with() can be used to achieve the
same.
Message-Id: <20191128131217.26294-3-gleb@scylladb.com>
Only do_stop_native_transport() uses the shared_ptr to prolong server's
lifetime until stop() completes, but do_with() can be used to achieve the
same.
Message-Id: <20191128131217.26294-2-gleb@scylladb.com>
* seastar 6f0ef32514...5c25de907a (7):
> shared_future: Fix crash when all returned futures time out
Fixes#5322.
> future: don't create temporaries on get_value().
> reactor: lower the default stall threshold to 200ms
> reactor: Simplify network initialization
> reactor: Replace most std::function with noncopyable_function
> futures: Avoid extra moves in SEASTAR_TYPE_ERASE_MORE mode
> inet_address: Make inet_address == operator ignore scope (again)
Merged pull request https://github.com/scylladb/scylla/pull/5311 from
Juliusz Stasiewicz:
This is a partial solution to #5139 (only for two types) because of the
above and because collections are much harder to do. They are coming in
a separate PR.
References #5139. Aggregate functions, like max(), when invoked
on `inet_address' and `time_native_type' used to choose
max(blob)->blob overload, with casting of argument and result to
bytes. This is because appropriate calls to
`aggregate_fcts::make_XXX_function()' were missing. This commit
adds them. Functioning remains the same but now clients see
user-friendly representations of aggregate result, not binary.
Comparing inet addresses without inet::operator< is performed by
trick, where ADL is bypassed by wrapping the name of std::min/max
and providing an overload of wrapper on inet type.
Currently we support to assign workdir from scylla.yaml, and we use many
hardcode '/var/lib/scylla' in setup scripts.
Some setup scripts get scylla directories by parsing scylla.yaml, introduced
parse_scylla_dirs_with_default() that adds default values if scylla directories
aren't assigned in scylla.yaml
Signed-off-by: Amos Kong <amos@scylladb.com>
Currently we are checking an invalid path of some default scylla directories,
the directories don't exist, so the tune will always be skipped. It caused by
two problem.
Problem 1: paths of default directories is invalid
Introduced by commit 5ec191536e, we try to tune some scylla default directories
if they exist. But the directory paths we try are wrong.
For example:
- What we check: /var/lib/scylla/commitlog_directory
- Correct one: /var/lib/scylla/commitlog
Problem 2: wrong path join
Introduced by commit 31ddb2145a, default_path might be replaced from
'/var/lib/scylla/' to '/var/lib/scylla'.
Our code tries to check an invalid path that is wrongly join, eg:
'/var/lib/scyllacommitlog'
Signed-off-by: Amos Kong <amos@scylladb.com>
The default values of data_file_directories and commitlog_directory were
commented by commit e0f40ed16a. It causes scylla_util.py:get_scylla_dirs() to
fail in checking the values.
This patch changed get_scylla_dirs() to return default data/commitlog
directories if they aren't set.
Fixes#5358
Reviewed-by: Pavel Emelyanov <xemul@scylladb.com>
Signed-off-by: Amos Kong <amos@scylladb.com>
Add a test, test_query.py::test_query_limit, to verify that the Limit
parameter correctly limits the number of rows returned by the Query.
This was supposed to already work correctly - but we never had a test for
it. As we hoped, the test passes (on both Alternator and DynamoDB).
Another test, test_query.py::test_query_limit_paging, verifies that
paging can be done with any setting of Limit. We already had tests
for paging of the Scan operation, but not for the Query operation.
Refs #5153
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This is a comprehensive test for the "Select" parameter of Query and Scan
operations, but only for the base-table case, not index, so another future
patch should add similar tests in test_gsi.py and test_lsi.py as well.
The main use of the Select parameter is to allow returning just the count
of items, instead of their content, but it also has other esoteric options,
all of which we test here.
The test currently succeeds on AWS DynamoDB, demonstrating that the test
is correct, but fails on Alternator because the "Select" parameter is not
yet supported. So the test is marked xfail.
Refs #5058
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently the command tries to read all seastar smp queues in its
initialization code in the constructor. This constructor is run each
time `scylla-gdb.py` is sourced in `gdb` which leads to slowdowns and
sometimes also annoying errors because the sourcing happens in the wrong
context and seastar symbols are not available.
Avoid this by running this initializing code lazily, on the first
invocation.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191127095408.112101-1-bdenes@scylladb.com>
This patchset adds missing "const" function qualifiers throughout
the Scylla code base, which would make code less error-prone.
The changeset incorporates Kostja's work regarding const qualifiers
in the cql code hierarchy along with a follow-up patch addressing the
review comment of the corresponding patch set (the patch subject is
"cql: propagate const property through prepared statement tree.").
The boost 1.67 release notes says
Changed maximum supported year from 10000 to 9999 to resolve various issues
So change the test to use a larger number so that we get an exception
with both boost 1.66 and boost 1.67.
Fixes#5344
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191126180327.93545-1-espindola@scylladb.com>
Docker on Fedora 31 is flakey, and is not supported at all on RHEL 8.
Podman is a drop-in replacement for docker; this series adds support
for using podman in dbuild.
Apart from actually working on Fedora 31 hosts,
podman is nicer in being more secure and not requiring a daemon.
Fixes#5332
As suggested in issue #4586 here is the helper that prints
"shutting down foo" message, then shuts the foo down, then
prints the "[it] was successull" one. In between it catches
the exception (if any) and warns this in logs.
By "then" I mean literally then, not the seastar's then() :)
Fixes: #4586
By default, semaphore exceptions bring along very little context:
either that a semaphore was broken or that it timed out.
In order to make debugging easier without introducing significant
runtime costs, a notion of named semaphore is added.
A named semaphore is simply a semaphore with statically defined
name, which is present in its errors, bringing valuable context.
A semaphore defined as:
auto sem = semaphore(0);
will present the following message when it breaks:
"Semaphore broken"
However, a named semaphore:
auto named_sem = named_semaphore(0, named_semaphore_exception_factory{"io_concurrency_sem"});
will present a message with at least some debugging context:
"Semaphore broken: io_concurrency_sem"
It's not much, but it would really help in pinpointing bugs
without having to inspect core dumps.
At the same time, it does not incur any costs for normal
semaphore operations (except for its creation), but instead
only uses more CPU in case an error is actually thrown,
which is considered rare and not to be on the hot path.
Refs #4999
Tests: unit(dev), manual: hardcoding a failure in view building code
Currently scylla_io_setup will skip in scylla_setup, because we didn't support
those new instance types.
I manually executed scylla_io_setup, and the scylla-server started and worked
well.
Let's apply this patch first, then check if there is some new problem in
ami-test.
Signed-off-by: Amos Kong <amos@scylladb.com>
cql_statement is a class representing a prepared statement in Scylla.
It is used concurrently during execution, so it is important that its
change is not changed by execution.
Add const qualifier to the execution methods family, throghout the
cql hierarchy.
Mark a few places which do mutate prepared statement state during
execution as mutable. While these are not affecting production today,
as code ages, they may become a source of latent bugs and should be
moved out of the prepared state or evaluated at prepare eventually:
cf_property_defs::_compaction_strategy_class
list_permissions_statement::_resource
permission_altering_statement::_resource
property_definitions::_properties
select_statement::_opts
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
---
v2:
- Have stop easrlier so that exception in start/listen do
not prevent prometheu.stop from calling
As suggested in issue #4586 here is the helper that prints
"shutting down foo" message, then shuts the foo down, then
prints the "shutting down foo was successfull". In between
it catches the exception (if any) and warns this in logs.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5218
from Piotr Jastrzębski:
Users should be able to decide whether they need preimage or not. There is
already an option for that but it's not respected by the implementation.
This PR adds support for this functionality.
Tests: unit(dev).
Individual patches:
cdc: Don't take storage_proxy as transformer::pre_image_select param
cdc::append_log_mutations: use do_with instead of shared_ptr
cdc::append_log_mutations: fix undefined behavior
cdc: enable preimage in test_pre_image_logging test
cdc: Return preimage only when it's requested
cdc: test both enabled and disabled preimage in test_pre_image_logging
Before stopping the db itself, stop the migration service.
It must be stopped before RPC, but RPC is not stopped yet
itself, so we should be safe here.
Here's the tail of the resulting logs:
INFO 2019-11-20 11:22:35,193 [shard 0] init - shutdown migration manager
INFO 2019-11-20 11:22:35,193 [shard 0] migration_manager - stopping migration service
INFO 2019-11-20 11:22:35,193 [shard 1] migration_manager - stopping migration service
INFO 2019-11-20 11:22:35,193 [shard 0] init - Shutdown database started
INFO 2019-11-20 11:22:35,193 [shard 0] init - Shutdown database finished
INFO 2019-11-20 11:22:35,193 [shard 0] init - stopping prometheus API server
INFO 2019-11-20 11:22:35,193 [shard 0] init - Scylla version 666.development-0.20191120.25820980f shutdown complete.
Also -- stop the mm on drain before the commitlog it stopped.
[Tomasz: mm needs the cl because pulling schema changes from other nodes
involves applying them into the database. So cl/db needs to be
stopped after mm is stopped.]
The drain logs would look like
...
INFO 2019-11-25 11:00:40,562 [shard 0] migration_manager - stopping migration service
INFO 2019-11-25 11:00:40,562 [shard 1] migration_manager - stopping migration service
INFO 2019-11-25 11:00:40,563 [shard 0] storage_service - DRAINED:
and then on stop
...
INFO 2019-11-25 11:00:46,427 [shard 0] init - shutdown migration manager
INFO 2019-11-25 11:00:46,427 [shard 0] init - Shutdown database started
INFO 2019-11-25 11:00:46,427 [shard 0] init - Shutdown database finished
INFO 2019-11-25 11:00:46,427 [shard 0] init - stopping prometheus API server
INFO 2019-11-25 11:00:46,427 [shard 0] init - Scylla version 666.development-0.20191125.3eab6cd54 shutdown complete.
Fixes#5300
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191125080605.7661-1-xemul@scylladb.com>
In get_full_row_hashes_with_rpc_stream and
repair_get_row_diff_with_rpc_stream_process_op which were introduced in
the "Repair switch to rpc stream" series, rx_hashes_nr metrics are not
updated correctly.
In the test we have 3 nodes and run repair on node3, we makes sure the
following metrics are correct.
assertEqual(node1_metrics['scylla_repair_tx_hashes_nr'] + node2_metrics['scylla_repair_tx_hashes_nr'],
node3_metrics['scylla_repair_rx_hashes_nr'])
assertEqual(node1_metrics['scylla_repair_rx_hashes_nr'] + node2_metrics['scylla_repair_rx_hashes_nr'],
node3_metrics['scylla_repair_tx_hashes_nr'])
assertEqual(node1_metrics['scylla_repair_tx_row_nr'] + node2_metrics['scylla_repair_tx_row_nr'],
node3_metrics['scylla_repair_rx_row_nr'])
assertEqual(node1_metrics['scylla_repair_rx_row_nr'] + node2_metrics['scylla_repair_rx_row_nr'],
node3_metrics['scylla_repair_tx_row_nr'])
assertEqual(node1_metrics['scylla_repair_tx_row_bytes'] + node2_metrics['scylla_repair_tx_row_bytes'],
node3_metrics['scylla_repair_rx_row_bytes'])
assertEqual(node1_metrics['scylla_repair_rx_row_bytes'] + node2_metrics['scylla_repair_rx_row_bytes'],
node3_metrics['scylla_repair_tx_row_bytes'])
Tests: repair_additional_test.py:RepairAdditionalTest.repair_almost_synced_3nodes_test
Fixes: #5339
Backports: 3.2
The code was iterating over a collection that was modified
at the same time. Iterators were used for that and collection
modification can invalidate all iterators.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5310 from
Avi Kivity:
This is a minor update as gcc and boost versions did not change. A noteable
update is patchelf 0.10, which adds support to large binaries.
A few minor issues exposed by the update are fixed in preparatory patches.
Patches:
dist: rpm: correct systemd post-uninstall scriptlet
build: force xz compression on rpm binary payload
tools: toolchain: update to Fedora 31
Since 90d6c0b, cache will abort when trying to detach partition
entries while they're updated. This should never happen. It can happen
though, when the update fails on bad_alloc, because the cleanup guard
invalidates the cache before it releases partition snapshots (held by
"update" coroutine).
Fix by destroying the coroutine first.
Fixes#5327.
Tests:
- row_cache_test (dev)
Message-Id: <1574360259-10132-1-git-send-email-tgrabiec@scylladb.com>
By default rpm uses dwz to merge the debug info from various
binaries. Unfortunately, it looks like addr2line has not been updated
to handle this:
// This works
$ addr2line -e build/release/scylla 0x1234567
$ dwz -m build/release/common.debug build/release/scylla.debug build/release/iotune.debug
// now this fails
$ addr2line -e build/release/scylla 0x1234567
I think the issue is
https://sourceware.org/bugzilla/show_bug.cgi?id=23652Fixes#5289
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191123015734.89331-1-espindola@scylladb.com>
By default we were compressing debug info only in release
executables. The idea, if I understand it correctly, is that those are
the ones we ship, so we want a more compact binary.
I don't think that was doing anything useful. The compression is just
gzip, so when we ship a .tar.xz, having the debug info compressed
inside the scylla binary probably reduces the overall compression a
bit.
When building a rpm the situation in amusing. As part of the rpm
build process the debug info is decompressed and extracted to an
external file.
Given that most of the link time goes to compressing debug info, it is
probably a good idea to just skip that.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191123022825.102837-1-espindola@scylladb.com>
Structure the code to be able to introduce futures.
Apply trivial cleanups.
Switch to asyncio and use it to work with processes and
handle signals. Cleanup all processes upon signal.
This patch implements a simple optimization for LWT: it makes PAXOS
prepare phase query locally and return the current value of the modified
key so that a separate query is not necessary. For more details see
patch 6. Patch 1 fixes a bug in next. Patches 2-5 contain trivial
preparatory refactoring.
Current LWT implementation uses at least three network round trips:
- first, execute PAXOS prepare phase
- second, query the current value of the updated key
- third, propose the change to participating replicas
(there's also learn phase, but we don't wait for it to complete).
The idea behind the optimization implemented by this patch is simple:
piggyback the current value of the updated key on the prepare response
to eliminate one round trip.
To generate less network traffic, only the closest to the coordinator
replica sends data while other participating replicas send digests which
are used to check data consistency.
Note, this patch changes the API of some RPC calls used by PAXOS, but
this should be okay as long as the feature in the early development
stage and marked experimental.
To assess the impact of this optimization on LWT performance, I ran a
simple benchmark that starts a number of concurrent clients each of
which updates its own key (uncontended case) stored in a cluster of
three AWS i3.2xlarge nodes located in the same region (us-west-1) and
measures the aggregate bandwidth and latency. The test uses shard-aware
gocql driver. Here are the results:
latency 99% (ms) bandwidth (rq/s) timeouts (rq/s)
clients before after before after before after
1 2 2 626 637 0 0
5 4 3 2616 2843 0 0
10 3 3 4493 4767 0 0
50 7 7 10567 10833 0 0
100 15 15 12265 12934 0 0
200 48 30 13593 14317 0 0
400 185 60 14796 15549 0 0
600 290 94 14416 15669 0 0
800 568 118 14077 15820 2 0
1000 710 118 13088 15830 9 0
2000 1388 232 13342 15658 85 0
3000 1110 363 13282 15422 233 0
4000 1735 454 13387 15385 329 0
That is, this optimization improves max LWT bandwidth by about 15%
and allows to run 3-4x more clients while maintaining the same level
of system responsiveness.
invoke_on() guarantees that captures object won't be destroyed until the
future returned by the invoked function is resolved so there's no need
to move key, token, proposal for calling paxos_state::*_impl helpers.
The test_health_only_works_for_root_path test checks that while Alternator's
HTTP server responds to a "GET /" request with success ("health check"), it
should respond to different URLs with failures (page not found).
One of the URLs it tested was "/..", but unfortunately some versions of
Python's HTTP client canonize this request to just a "/", causing the
request to unexpectedly succeed - and the test to fail.
So this patch just drops the "/.." check. A few other nonsense URLs are
attempted by the test - e.g., "/abc".
Fixes#5321
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
One of the fields still missing in DescribeTable's response (Refs #5026)
was the table's schema - KeySchema and AttributeDefinitions.
This patch adds this missing feature, and enables the previously-xfailing
test test_describe_table_schema.
A complication of this patch is that in a table with secondary indexes,
we need to return not just the base table's schema, but also the indexes'
schema. The existing tests did not cover that feature, so we add here
two more tests in test_gsi.py for that.
One of these secondary-index schema tests, test_gsi_2_describe_table_schema,
still fails, because it outputs a range-key which Scylla added to a view
because of its own implementation needs, but wasn't in the user's
definition of the GSI. I opened a separate issue #5320 for that.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Serialize reference_wrapper<T> as T and make sure is_equivalent<> treats
reference_wrapper<T> wrapped in std::optional<> or std::variant<>, or
std::tuple<> as T.
We need it to avoid copying query::result while serializing
paxos::promise.
Currently even if `-a` or `-s 0` is provided, `scylla task_histogram`
will scan a limited amount of pages due to a bug in the scan loop's stop
condition, which will be trigger a stop once the default sample limit is
reached. Fix the loop by skipping this check when the user wants to scan
all tasks.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191121141706.29476-1-bdenes@scylladb.com>
At least some versions of 'podman logs --follow' hang when the
container eventually exits (also happens with docker on recent
versions). Fortunately, we don't need to use 'podman logs --follow'
and can use the more natural non-detached 'podman run', because
podman does not proxy SIGTERM and instead shuts down the container
when it receives it.
So, to work around the problem, use the same code path in interactive
and non-interactive runs, when podman is in use instead of docker.
With docker, we went to considerable lengths to ensure that
access to mounted volume was done using the calling user, including
supplementary groups. This avoids root-owned files being left around
after a build, and ensures that access to group-shared files (like
/var/cache/ccache) works as expected.
All of this is unnecessary and broken when using podman. Podman
uses a proxy to access files on behalf of the container, so naturally
all access is done using the calling user's identity. Since it remaps
user and group IDs, assigning the host uid/gid is meaningless. Using
--userns host also breaks, because sudo no longer works.
Fix this by making all the uid/gid/selinux games specific to docker and
ignore them when using podman. To preserve the functionality of tools
that depend on $HOME, set that according to the host setting.
The original fix (10f6b125c8) didn't
take into account that if there was a failed memtable flush (Refs
flush) but is not a flushable memtable because it's not the latest in
the memtable list. If that happens, it means no other memtable is
flushable as well, cause otherwise it would be picked due to
evictable_occupancy(). Therefore the right action is to not flush
anything in this case.
Suspected to be observed in #4982. I didn't manage to reproduce after
triggering a failed memtable flush.
Fixes#3717
podman refuses to start with duplicate volumes, which routinely
happen if the toplevel directory is the working directory. Detect
this and avoid the duplicate.
UnitTest class uses juggles with the name 'args' quite a bit to
construct the command line for a unit test, so let's spread
the harness command line arguments from the unit test command line
arguments a bit apart by consistently calling the harness command line
arguments 'options', and unit test command line arguments 'args'.
Rename usage() to parse_cmd_line().
Create unique UnitTest objects in find_tests() for each found match,
including repeat, to ensure each test has its own unique id.
This will also be used to store execution state in the test.
It somewhat stands in the way of using asyncio
This patch also implements a more comprehensive
fix for #5303, since we not only have --repeat, but
run some tests in different configurations, in which
case xml output is also overwritten.
When starting scylla daemon as non-root the initialization fails
because standard /var/lib/scylla is not accessible by regular users.
Making the default dir accessible for user is not very convenient
either, as it will cause conflicts if two or more instances of scylla
are in use.
This problem can be resolved by specifying --commitlog-directory,
--data-file-directories, etc on start, but it's too much typing. I
propose to revive Nadav's --home option that allows to move all the
directories under the same prefix in one go.
Unlike Nadav's approach the --workdir option doesn't do any tricky
manipulations with existing directories. Insead, as Pekka suggested,
the individual directories are placed under the workir if and only
if the respective option is NOT provided. Otherwise the directory
configuration is taken as is regardless of whether its absolute or
relative path.
The values substutution is done early on start. Avi suggested that
this is unsafe wrt HUP config re-read and proper paths must be
resolved on the fly, but this patch doesn't address that yet, here's
why.
First of all, the respective options are MustRestart now and the
substitution is done before HUP handler is installed.
Next, commitlog and data_file values are copied on start, so marking
the options as LiveUpdate won't make any effect.
Finally, the existing named_value::operator() returns a reference,
so returning a calculated (and thus temporary) value is not possible
(from my current understanding, correct me if I'm wrong). Thus if we
want the *_directory() to return calculated value all callers of them
must be patched to call something different (e.g. *_directory.get() ?)
which will lead to more confusion and errors.
Changes v3:
- the option is --workdir back again
- the existing *directory are only affected if unset
- default config doesn't have any of these set
- added the short -W alias
Changes v2:
- the option is --home now
- all other paths are changed to be relative
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191119130059.18066-1-xemul@scylladb.com>
I found these mismatched types while converting some member functions
to standalone functions, since they have to use the public API that
has more type checks.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191120181213.111758-4-espindola@scylladb.com>
Use pkg-config to search for Lua dependencies rather
than hard-code include and link paths.
Avoid using boost internals, not present in earlier
versions of boost.
Reviewed-by: Rafael Avila de Espindola <espindola@scylladb.com>
Message-Id: <20191120170005.49649-1-kostja@scylladb.com>
Use `-Wl,--threads` flag to enable multi-threaded linking when
using `ld.gold` linker.
Additional compilation test is required because it depends on whether
or not the `gold` linker has been compiled with `--enable-threads` option.
This patch introduces a substantial improvement to the link times of
`scylla` binary in release and debug modes (around 30 percent).
Local setup reports the following numbers with release build for
linking only build/release/scylla:
Single-threaded mode:
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:09.30
Multi-threaded mode:
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:51.57
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20191120163922.21462-1-pa.solodovnikov@scylladb.com>
Merged patch series from Peng Jian, adding optionally-enabled Redis API
support to Scylla. This feature is experimental, and partial - the extent
of this support is detailed in docs/redis/redis.md.
Patches:
Document: add docs/redis/redis.md
redis: Redis API in Scylla
Redis API: graft redis module to Scylla
redis-test: add test cases for Redis API
This is a minor update as gcc and boost versions do not change.
glibc-langpack-en no longer gets pulled in by default. As it is required
by some locale use somewhere, it is added to the explicit dependencies.
Fedora 31 switched the default compression to zstd, which isn't readable
by some older rpm distributions (CentOS 7 in particular). Tell it to use
the older xz compression instead, so packages produced on Fedora 31 can
be installed on older distributions.
The post-uninstall scriptlet requires a parameter, but older versions
of rpm survived without it. Fedora 31's rpm is more strict, so supply
this parameter.
In this document, the detailed design and implementation of Redis API in
Scylla is provided.
v2: build: work around ragel 7 generated code bug (suggested by Avi)
Ragel 7 incorrectly emits some unused variables that don't compile.
As a workaround, sed them away.
Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>
Signed-off-by: Amos Kong <amos@scylladb.com>
Scylla has advantage and amazing features. If Redis build on the top of Scylla,
it has the above features automatically. It's achived great progress
in cluster master managment, data persistence, failover and replication.
The benefits to the users are easy to use and develop in their production
environment, and taking avantages of Scylla.
Using the Ragel to parse the Redis request, server abtains the command name
and the parameters from the request, invokes the Scylla's internal API to
read and write the data, then replies to client.
Signed-off-by: Peng Jian, <pengjian.uestc@gmail.com>
Merged patch set by Piotr Dulikowski:
This change corrects condition on which a row was considered expired by its
TTL.
The logic that decides when a row becomes expired was inconsistent with the
logic that decides if a single cell is expired. A single cell becomes expired
when expiry_timestamp <= now, while a row became expired when
expiry_timestamp < now (notice the strict inequality). For rows inserted
with TTL, this caused non-key cells to expire (change their values to null)
one second before the row disappeared. Now, row expiry logic uses non-strict
inequality.
Fixes#4263,
Fixes#5290.
Tests:
unit(dev)
python test described in issue #5290
It is useful to have an option to limit the execution time of a shell
script.
This patch adds an optional timeout parameter, if a parameter will be
provided a command will return and failure if the duration is passed.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Merged patch series from Juliusz Stasiewicz:
Welcome to my first PR to Scylla!
The task was intended as a warm-up ("noob") exercise; its description is
here: #4182 Sorry, I also couldn't help it and did some scouting: edited
descriptions of some metrics and shortened few annoyingly long LoC.
Those are typically symptoms of use-after-free or memory corruption in
the program. It's better to catch such error sooner than later.
That situation is also dangerous since if a valid descriptor would
land under the invalid access, not the one which was intended for the
operation, then the operation may be performed on the wrong file and
result in corruption.
Message-Id: <1565206788-31254-1-git-send-email-tgrabiec@scylladb.com>
This change corrects condition on which a row was considered expired by
its TTL.
The logic that decides when a row becomes expired was inconsistent with
the logic that decides if a single cell is expired. A single cell
becomes expired when `expiry_timestamp <= now`, while a row became
expired when `expiry_timestamp < now` (notice the strict inequality).
For rows inserted with TTL, this caused non-key cells to expire (change
their values to null) one second before the row disappeared. Now, row
expiry logic uses non-strict inequality.
Fixes: #4263, #5290.
Tests:
- unit(dev)
- python test described in issue #5290
Currently, we overwrite the same XML output file for each test repeat
cycle. This can cause invalid XML to be generated if the XML contents
don't match exactly for every iteration.
Fix the problem by appending the test repeat cycle in the XML filename
as follows:
$ ./test.py --repeat 3 --name vint_serialization_test --mode dev --jenkins jenkins_test
$ ls -1 *.xml
jenkins_test.release.vint_serialization_test.0.boost.xml
jenkins_test.release.vint_serialization_test.1.boost.xml
jenkins_test.release.vint_serialization_test.2.boost.xml
Fixes#5303.
Message-Id: <20191119092048.16419-1-penberg@scylladb.com>
In a cross-dc large cluster, the receiver node of the gossip SYN message
might be slow to send the gossip ACK message. The ack messages can be
large if the payload of the application state is big, e.g.,
CACHE_HITRATES with a lot of tables. As a result, the unlimited ACK
message can consume unlimited amount of memory which causes OOM
eventually.
To fix, this patch queues the SYN message and handles it later if the
previous ACK message is still being sent. However, we only store the
latest SYN message. Since the latest SYN message from peer has the
latest information, so it is safe to drop the previous SYN message and
keep the latest one only. After this patch, there can be at most 1
pending SYN message and 1 pending ACK message per peer node.
"
This patch series adds only UDF support, UDA will be in the next patch series.
With this all CQL types are mapped to Lua. Right now we setup a new
lua state and copy the values for each argument and return. This will
be optimized once profiled.
We require --experimental to enable UDF in case there is some change
to the table format.
"
* 'espindola/udf-only-v4' of https://github.com/espindola/scylla: (65 commits)
Lua: Document the conversions between Lua and CQL
Lua: Implement decimal subtraction
Lua: Implement decimal addition
Lua: Implement support for returning decimal
Lua: Implement decimal to string conversion
Lua: Implement decimal to floating point conversion
Lua: Implement support for decimal arguments
Lua: Implement support for returning varint
Lua: Implement support for returning duration
Lua: Implement support for duration arguments
Lua: Implement support for returning inet
Lua: Implement support for inet arguments
Lua: Implement support for returning time
Lua: Implement support for time arguments
Lua: Implement support for returning timeuuid
Lua: Implement support for returning uuid
Lua: Implement support for uuid and timeuuid arguments
Lua: Implement support for returning date
Lua: Implement support for date arguments
Lua: Implement support for returning timestamp
...
Add mode_list rule to ninja build and use it by default when searching
for tests in test.py.
Now it is no longer necessary to explicitly specify the test mode when
invoking test.py.
(cherry picked from commit a211ff30c7f2de12166d8f6f10d259207b462d4b)
The goal of this patch is to fix issue #5280, a rather serious Alternator
bug, where Scylla fails to restart when an Alternator table has secondary
indexes (LSI or GSI).
Traditionally, Cassandra allows table names to contain only alphanumeric
characters and underscores. However, most of our internal implementation
doesn't actually have this restriction. So Alternator uses the characters
':' and '!' in the table names to mark global and local secondary indexes,
respectively. And this actually works. Or almost...
This patch fixes a problem of listing, during boot, the sstables stored
for tables with such non-traditional names. The sstable listing code
needlessly assumes that the *directory* name, i.e., the CF names, matches
the "\w+" regular expression. When an sstable is found in a directory not
matching such regular expression, the boot fails. But there is no real
reason to require such a strict regular expression. So this patch relaxes
this requirement, and allows Scylla to boot with Alternator's GSI and LSI
tables and their names which include the ":" and "!" characters, and in
fact any other name allowed as a directory name.
Fixes#5280.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191114153811.17386-1-nyh@scylladb.com>
This document adds information about how fixes are tracked to be
backported into releases and what is the procedure that is followed to
backport those fixes.
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Allow filtering the resolved addresses by a startswith string.
The common use case if for resolving vtable ptrs, when resolving
the output of `find_vptrs` that may be too long for the host
(running gdb) memory size. In this case the number of vtable
ptrs is considerably smaller than the total number of objects
returned by find_ptrs (e.g. 462 vs. 69625 in a OOM core I
examined from scylla --smp=2 --memory=1024M)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
CQL tracing would only report file I/O involving one sstable, even if
multiple sstables were read from during the query.
Steps to reproduce:
create a table with NullCompactionStrategy
insert row, flush memtables
insert row, flush memtables
restart Scylla
tracing on
select * from table
The trace would only report DMA reads from one of the two sstables.
Kudos to @denesb for catching this.
Related issue: #4908
There are ... signs of massive start/stop code rework in the
main() function. While fixing the sub-modules interdependencies
during start/stop I've polished these signs too, so here's the
simplest ones.
This is just the minimum to pass a value to Lua. Right now you can't
actually do anything with it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This adds support for all integer types. Followup commits will
implement the missing types.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This makes it substantially simpler to support both varint and
decimal, which will be implemented in a followup patch.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this we support all simple integer types. Followup patches will
implement the missing types.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This add a wrapper around the lua interpreter so that function
executions are interruptible and return futures.
With this patch it is possible to write and use simple UDFs that take
and return integer values.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This adds a requires_thread predicate to functions and propagates that
up until we get to code that already returns futures.
We can then use the predicate to decide if we need to use
seastar::async.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This refactors test_schema_digest_does_not_change to also test a
schema with user defined functions and user defined aggregates.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this it is possible to create user defined functions and
aggregates and they are saved to disk and the schema change is
propagated.
It is just not possible to call them yet.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The parser now rejects having both OR REPLACE and IF NOT EXISTS in the
same statement.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This updates UDF syntax to the current specification.
In particular, this removes DETERMINISTIC and adds "CALLED ON NULL
INPUT" and "RETURNS NULL ON NULL INPUT".
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
At some point we should make the function list non static, but this
allows us to write tests for now.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This avoids allocating a std::vector and is more flexible since the
iterator can be passed to erase.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This is a simple wrapper that allows code that is not in the types
hierarchy to visit a data_value.
Will be used by UDF.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Similar to "gossip: Limit number of pending gossip ACK messages", limit
the number of pending gossip ACK2 messages in gossiper::handle_ack_msg.
Fixes#5210
In a cross-dc large cluster, the receiver node of the gossip SYN message
might be slow to send the gossip ACK message. The ack messages can be
large if the payload of the application state is big, e.g.,
CACHE_HITRATES with a lot of tables. As a result, the unlimited ACK
message can consume unlimited amount of memory which causes OOM
eventually.
To fix, this patch queues the SYN message and handles it later if the
previous ACK message is still being sent. However, we only store the
latest SYN message. Since the latest SYN message from peer has the
latest information, so it is safe to drop the previous SYN message and
keep the latest one only. After this patch, there can be at most 1
pending SYN message and 1 pending ACK message per peer node.
Fixes#5210
Now that compaction returns only after the compacted sstables are
deleted we no longer need to stop the base to force waiting
for deletes (that were previously done asynchronously)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
# Reporting an issue
## Reporting an issue
Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues. Fill in as much information as you can in the issue template, especially for performance problems.
# Contributing Code to Scylla
## Contributing Code to Scylla
To contribute code to Scylla, you need to sign the [Contributor License Agreement](http://www.scylladb.com/opensource/cla/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.
To contribute code to Scylla, you need to sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.
**Note**: Compiling Scylla requires, conservatively, 2 GB of memory per native
thread, and up to 3 GB per native thread while linking. GCC >= 8.1.1. is
thread, and up to 3 GB per native thread while linking. GCC >= 10 is
required.
Scylla is built with [Ninja](https://ninja-build.org/), a low-level rule-based system. A Python script, `configure.py`, generates a Ninja file (`build.ninja`) based on configuration options.
@@ -141,7 +153,7 @@ In v3:
"Tests: unit ({mode}), dtest ({smp})"
```
The usual is "Tests: unit (release)", although running debug tests is encouraged.
The usual is "Tests: unit (dev)", although running debug tests is encouraged.
5. When answering review comments, prefer inline quotes as they make it easier to track the conversation across multiple e-mails.
To get the build going quickly, Scylla offers a [frozen toolchain](tools/toolchain/README.md)
which would build and run Scylla using a pre-configured Docker image.
Using the frozen toolchain will also isolate all of the installed
dependencies in a Docker container.
Assuming you have met the toolchain prerequisites, which is running
Docker in user mode, building and running is as easy as:
## What is Scylla?
Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB.
Scylla embraces a shared-nothing approach that increases throughput and storage capacity to realize order-of-magnitude performance improvements and reduce hardware costs.
For more information, please see the [ScyllaDB web site].
[ScyllaDB web site]: https://www.scylladb.com
## Build Prerequisites
Scylla is fairly fussy about its build environment, requiring very recent
versions of the C++20 compiler and of many libraries to build. The document
[HACKING.md](HACKING.md) includes detailed information on building and
developing Scylla, but to get Scylla building quickly on (almost) any build
machine, Scylla offers a [frozen toolchain](tools/toolchain/README.md),
This is a pre-configured Docker image which includes recent versions of all
the required compilers, libraries and build tools. Using the frozen toolchain
allows you to avoid changing anything in your build machine to meet Scylla's
requirements - you just need to meet the frozen toolchain's prerequisites
(mostly, Docker or Podman being available).
## Building Scylla
Building Scylla with the frozen toolchain `dbuild` is as easy as:
* run Scylla with one CPU and ./tmp as data directory
This will start a Scylla node with one CPU core allocated to it and data files stored in the `tmp` directory.
The `--developer-mode` is needed to disable the various checks Scylla performs at startup to ensure the machine is configured for maximum performance (not relevant on development workstations).
Please note that you need to run Scylla with `dbuild` if you built it with the frozen toolchain.
seastar::metrics::description("number of operations via Alternator API"),{op(CamelCaseName)}),
#define OPERATION_LATENCY(name, CamelCaseName) \
seastar::metrics::make_histogram("op_latency", \
seastar::metrics::description("Latency histogram of an operation via Alternator API"),{op(CamelCaseName)},[this]{returnapi_operations.name.get_histogram(1,20);}),
seastar::metrics::description("Latency histogram of an operation via Alternator API"),{op(CamelCaseName)},[this]{returnto_metrics_histogram(api_operations.name);}),
"summary":"Checks that CDC streams reflect current cluster topology and regenerates them if not.",
"type":"void",
"nickname":"cdc_streams_check_and_repair",
"produces":[
"application/json"
],
"parameters":[]
}
]
},
{
"path":"/storage_service/snapshots",
"operations":[
@@ -582,7 +597,15 @@
},
{
"name":"kn",
"description":"Comma seperated keyspaces name to snapshot",
"description":"Comma seperated keyspaces name that their snapshot will be deleted",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"cf",
"description":"an optional table name that its snapshot will be deleted",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -646,7 +669,7 @@
{
"method":"POST",
"summary":"Trigger a cleanup of keys on a single keyspace",
"type":"int",
"type":"long",
"nickname":"force_keyspace_cleanup",
"produces":[
"application/json"
@@ -678,7 +701,7 @@
{
"method":"GET",
"summary":"Scrub (deserialize + reserialize at the latest version, skipping bad rows if any) the given keyspace. If columnFamilies array is empty, all CFs are scrubbed. Scrubbed CFs will be snapshotted first, if disableSnapshot is false",
"type":"int",
"type":"long",
"nickname":"scrub",
"produces":[
"application/json"
@@ -726,7 +749,7 @@
{
"method":"GET",
"summary":"Rewrite all sstables to the latest version. Unlike scrub, it doesn't skip bad rows and do not snapshot sstables first.",
"type":"int",
"type":"long",
"nickname":"upgrade_sstables",
"produces":[
"application/json"
@@ -800,7 +823,7 @@
"summary":"Return an array with the ids of the currently active repairs",
"type":"array",
"items":{
"type":"int"
"type":"long"
},
"nickname":"get_active_repair_async",
"produces":[
@@ -810,13 +833,50 @@
}
]
},
{
"path":"/storage_service/repair_status/",
"operations":[
{
"method":"GET",
"summary":"Query the repair status and return when the repair is finished or timeout",
"type":"string",
"enum":[
"RUNNING",
"SUCCESSFUL",
"FAILED"
],
"nickname":"repair_await_completion",
"produces":[
"application/json"
],
"parameters":[
{
"name":"id",
"description":"The repair ID to check for status",
"required":true,
"allowMultiple":false,
"type":"long",
"paramType":"query"
},
{
"name":"timeout",
"description":"Seconds to wait before the query returns even if the repair is not finished. The value -1 or not providing this parameter means no timeout",
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.