Compare commits

..

181 Commits

Author SHA1 Message Date
Botond Dénes
c37f5938fd mutation_writer: feed_writer(): handle exceptions from consume_end_of_stream()
Currently the exception handling code of feed_writer() assumes
consume_end_of_stream() doesn't throw. This is false and an exception
from said method can currently lead to an unclean destroy of the writer
and reader. Fix by also handling exceptions from
consume_end_of_stream() too.

Closes #10147

(cherry picked from commit 1963d1cc25)
2022-03-03 10:45:40 +01:00
Yaron Kaikov
fa90112787 release: prepare for 4.4.9 2022-02-16 14:24:54 +02:00
Nadav Har'El
f5895e5c04 docker: don't repeat "--alternator-address" option twice
If the Docker startup script is passed both "--alternator-port" and
"--alternator-https-port", a combination which is supposed to be
allowed, it passes to Scylla the "--alternator-address" option twice.
This isn't necessary, and worse - not allowed.

So this patch fixes the scyllasetup.py script to only pass this
parameter once.

Fixes #10016.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220202165814.1700047-1-nyh@scylladb.com>
(cherry picked from commit cb6630040d)
2022-02-03 18:40:12 +02:00
Avi Kivity
ce944911f2 Update seastar submodule (gratuitous exceptions on allocation failure)
* seastar 59eeadc720...1fb2187322 (1):
  > core: memory: Avoid current_backtrace() on alloc failure when logging suppressed

Fixes #9982.
2022-01-30 20:08:43 +02:00
Avi Kivity
b220130e4a Revert "Merge 'scylla_raid_setup: use mdmonitor only when RAID level > 0' from Takuya ASADA"
This reverts commit de4f5b3b1f. This branch
doesn't support RAID 5, so it breaks at runtime.

Ref #9540.
2022-01-30 11:00:21 +02:00
Avi Kivity
de4f5b3b1f Merge 'scylla_raid_setup: use mdmonitor only when RAID level > 0' from Takuya ASADA
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.

Fixes #9540

----

This reverts 0d8f932 and introduce correct fix.

Closes #9970

* github.com:scylladb/scylla:
  scylla_raid_setup: use mdmonitor only when RAID level > 0
  Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"

(cherry picked from commit df22396a34)
2022-01-27 10:27:45 +02:00
Avi Kivity
84a42570ec Update tools/java submodule (maxPendingPerConnection default)
* tools/java 14e635e5de...e8accfbf45 (2):
  > Fix NullPointerException in SettingsMode
  > cassandra-stress: Remove maxPendingPerConnection default

Ref #7748.
2022-01-12 21:38:48 +02:00
Nadav Har'El
001f57ec0c alternator: allow Authorization header to be without spaces
The "Authorization" HTTP header is used in DynamoDB API to sign
requests. Our parser for this header, in server::verify_signature(),
required the different components of this header to be separated by
a comma followed by a whitespace - but it turns out that in DynamoDB
both spaces and commas are optional - one of them is enough.

At least one DynamoDB client library - the old "boto" (which predated
boto3) - builds this header without spaces.

In this patch we add a test that shows that an Authorization header
with spaces removed works fine in DynamoDB but didn't work in
Alternator, and after this patch modifies the parsing code for this
header, the test begins to pass (and the other tests show that the
previously-working cases didn't break).

Fixes #9568

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211101214114.35693-1-nyh@scylladb.com>
(cherry picked from commit 56eb994d8f)
2021-12-29 15:07:47 +02:00
Nadav Har'El
3279718d52 alternator: return the correct Content-Type header
Although the DynamoDB API responses are JSON, additional conventions apply
to these responses - such as how error codes are encoded in JSON. For this
reason, DynamoDB uses the content type `application/x-amz-json-1.0` instead
of the standard `application/json` in its responses.

Until this patch, Scylla used `application/json` in its responses. This
unexpected content-type didn't bother any of the AWS libraries which we
tested, but it does bother the aiodynamo library (see HENNGE/aiodynamo#27).

Moreover, we should return the x-amz-json-1.0 content type for future
proofing: It turns out that AWS already defined x-amz-json-1.1 - see:
https://awslabs.github.io/smithy/1.0/spec/aws/aws-json-1_1-protocol.html
The 1.1 content type differs (only) in how it encodes error replies.
If one day DynamoDB starts to use this new reply format (it doesn't yet)
and if DynamoDB libraries will need to differenciate between the two
reply formats, Alternator better return the right one.

This patch also includes a new test that the Content-Type header is
returned with the expected value. The test passes on DynamoDB, and
after this patch it starts to pass on Alternator as well.

Fixes #9554.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211031094621.1193387-1-nyh@scylladb.com>
(cherry picked from commit 6ae0ea0c48)
2021-12-29 14:14:24 +02:00
Takuya ASADA
c128994f90 scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8
On CentOS8, mdmonitor.service does not works correctly when using
mdadm-4.1-15.el8.x86_64 and later versions.
Until we find a solution, let's pinning the package version to older one
which does not cause the issue (4.1-14.el8.x86_64).

Fixes #9540

Closes #9782

(cherry picked from commit 0d8f932f0b)
2021-12-28 11:38:33 +02:00
Nadav Har'El
9af2e5ead1 Update Seastar module with additional backports
Backported an additional Seastar patch:

  > Merge 'metrics: Fix dtest->ulong conversion error' from Benny Halevy

Fixes #9794.
2021-12-14 13:06:02 +02:00
Avi Kivity
be695a7353 Revert "cql3: Reject updates with NULL key values"
This reverts commit 146f7b5421. It
causes a regression, and needs an additional fix. The bug is not
important enough to merit this complication.

Ref #9311.
2021-12-08 15:17:45 +02:00
Botond Dénes
cc9285697d mutation_reader: shard_reader: ensure referenced objects are kept alive
The shard reader can outlive its parent reader (the multishard reader).
This creates a problem for lifecycle management: readers take the range
and slice parameters by reference and users keep these alive until the
reader is alive. The shard reader outliving the top-level reader means
that any background read-ahead that it has to wait on will potentially
have stale references to the range and the slice. This was seen in the
wild recently when the evictable reader wrapped by the shard reader hit
a use-after-free while wrapping up a background read-ahead.
This problem was solved by fa43d76 but any previous versions are
susceptible to it.

This patch solves this problem by having the shard reader copy and keep
the range and slice parameters in stable storage, before passing them
further down.

Fixes: #9719

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211202113910.484591-1-bdenes@scylladb.com>
(cherry picked from commit 417e853b9b)
2021-12-06 15:25:57 +02:00
Nadav Har'El
21d140febc alternator: add missing BatchGetItem metric
Unfortunately, defining metrics in Scylla requires some code
duplication, with the metrics declared in one place but exported in a
different place in the code. When we duplicated this code in Alternator,
we accidentally dropped the first metric - for BatchGetItem. The metric
was accounted in the code, but not exported to Prometheus.

In addition to fixing the missing metric, this patch also adds a test
that confirms that the BatchGetItem metric increases when the
BatchGetItem operation is used. This test failed before this patch, and
passes with it. The test only currently tests this for BatchGetItem
(and BatchWriteItem) but it can be later expanded to cover all the other
operations as well.

Fixes #9406

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210929121611.373074-1-nyh@scylladb.com>
(cherry picked from commit 5cbe9178fd)
2021-12-06 12:45:34 +02:00
Yaron Kaikov
77e05ca482 release: prepare for 4.4.8 2021-12-05 21:52:32 +02:00
Eliran Sinvani
5375b8f1a1 testlib: close index_reader to avoid racing condition
In order to avoid race condition introduced in 9dce1e4 the
index_reader should be closed prior to it's destruction.
This only exposes 4.4 and earlier releases to this specific race.
However, it is always a good idea to first close the index reader
and only then destroy it since it is most likely to be assumed by
all developers that will change the reader index in the future.

Ref #9704 (because on 4.4 and earlier releases are vulnerable).

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Fixes #9704

(cherry picked from commit ddd7248b3b)

Closes #9717
2021-12-05 12:02:13 +01:00
Juliusz Stasiewicz
7a82432e38 transport: Fix abort on certain configurations of native_transport_port(_ssl)
The reason was accessing the `configs` table out of index. Also,
native_transport_port-s can no longer be disabled by setting to 0,
as per the table below.

Rules for port/encryption (the same apply to shard_aware counterpart):

np  := native_transport_port.is_set()
nps := native_transport_port_ssl.is_set()
ceo := ceo.at("enabled") == "true"
eq  := native_transport_port_ssl() == native_transport_port()

+-----+-----+-----+-----+
|  np | nps | ceo |  eq |
+-----+-----+-----+-----+
|  0  |  0  |  0  |  *  |   =>   listen on native_transport_port, unencrypted
|  0  |  0  |  1  |  *  |   =>   listen on native_transport_port, encrypted
|  0  |  1  |  0  |  *  |   =>   nonsense, don't listen
|  0  |  1  |  1  |  *  |   =>   listen on native_transport_port_ssl, encrypted
|  1  |  0  |  0  |  *  |   =>   listen on native_transport_port, unencrypted
|  1  |  0  |  1  |  *  |   =>   listen on native_transport_port, encrypted
|  1  |  1  |  0  |  *  |   =>   listen on native_transport_port, unencrypted
|  1  |  1  |  1  |  0  |   =>   listen on native_transport_port, unencrypted + native_transport_port_ssl, encrypted
|  1  |  1  |  1  |  1  |   =>   native_transport_port(_ssl), encrypted
+-----+-----+-----+-----+

Fixes #7783
Fixes #7866

Closes #7992

(cherry picked from commit 29e4737a9b)
2021-11-29 17:37:31 +02:00
Dejan Mircevski
146f7b5421 cql3: Reject updates with NULL key values
We were silently ignoring INSERTs with NULL values for primary-key
columns, which Cassandra rejects.  Fix it by rejecting any
modification_statement that would operate on empty partition or
clustering range.

This is the most direct fix, because range and slice are calculated in
one place for all modification statements.  It covers not only NULL
cases, but also impossible restrictions like c>0 AND c<0.
Unfortunately, Cassandra doesn't treat all modification statements
consistently, so this fix cannot fully match its behavior.  We err on
the side of tolerance, accepting some DELETE statements that Cassandra
rejects.  We add a TODO for rejecting such DELETEs later.

Fixes #7852.

Tests: unit (dev), cql-pytest against Cassandra 4.0

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9286

(cherry picked from commit 1fdaeca7d0)
2021-11-29 17:31:54 +02:00
Nadav Har'El
e1c7a906f0 cql: fix error return from execution of fromJson() and other functions
As reproduced in cql-pytest/test_json.py and reported in issue #7911,
failing fromJson() calls should return a FUNCTION_FAILURE error, but
currently produce a generic SERVER_ERROR, which can lead the client
to think the server experienced some unknown internal error and the
query can be retried on another server.

This patch adds a new cassandra_exception subclass that we were missing -
function_execution_exception - properly formats this error message (as
described in the CQL protocol documentation), and uses this exception
in two cases:

1. Parse errors in fromJson()'s parameters are converted into a
   function_execution_exception.

2. Any exceptions during the execute() of a native_scalar_function_for
   function is converted into a function_execution_exception.
   In particular, fromJson() uses a native_scalar_function_for.

   Note, however, that functions which already took care to produce
   a specific Cassandra error, this error is passed through and not
   converted to a function_execution_exception. An example is
   the blobAsText() which can return an invalid_request error, so
   it is left as such and not converted. This also happens in Cassandra.

All relevant tests in cql-pytest/test_json.py now pass, and are
no longer marked xfail. This patch also includes a few more improvements
to test_json.py.

Fixes #7911

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210118140114.4149997-1-nyh@scylladb.com>
(cherry picked from commit 702b1b97bf)
2021-11-29 16:59:56 +02:00
Piotr Jastrzebski
c5d6e75db8 sstables: Fix writing KA/LA sstables index
Before this patch when writing an index block, the sstables writer was
storing range tombstones that span the boundary of the block in order
of end bounds. This led to a range tombstone being ignored by a reader
if there was a row tombstone inside it.

This patch sorts the range tombstones based on start bound before
writing them to the index file.

The assumption is that writing an index block is rare so we can afford
sortting the tombstones at that point. Additionally this is a writer of
an old format and writing to it will be dropped in the next major
release so it should be rarely used already.

Kudos to Kamil Braun <kbraun@scylladb.com> for finding the reproducer.

Test: unit(dev)

Fixes #9690

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit scylladb/scylla-enterprise@eb093afd6f)
(cherry picked from commit ab425a11a8)
2021-11-28 11:09:39 +02:00
Tomasz Grabiec
da630e80ea cql: Fix missing data in indexed queries with base table short reads
Indexed queries are using paging over the materialized view
table. Results of the view read are then used to issue reads of the
base table. If base table reads are short reads, the page is returned
to the user and paging state is adjusted accordingly so that when
paging is resumed it will query the view starting from the row
corresponding to the next row in the base which was not yet
returned. However, paging state's "remaining" count was not reset, so
if the view read was exhausted the reading will stop even though the
base table read was short.

Fix by restoring the "remaining" count when adjusting the paging state
on short read.

Tests:

  - index_with_paging_test
  - secondary_index_test

Fixes #9198
Message-Id: <20210818131840.1160267-1-tgrabiec@scylladb.com>

(cherry picked from commit 1e4da2dcce)
2021-11-23 11:22:30 +02:00
Takuya ASADA
8ea1cbe78d docker: add stopwaitsecs
We need stopwaitsecs just like we do TimeoutStpSec=900 on
scylla-server.service, to avoid timeout on scylla-server shutdown.

Fixes #9485

Closes #9545

(cherry picked from commit c9499230c3)
2021-11-15 13:36:48 +02:00
Asias He
03b04d40f2 gossip: Fix use-after-free in real_mark_alive and mark_dead
In commit 11a8912093 (gossiper:
get_gossip_status: return string_view and make noexcept)
get_gossip_status returns a pointer to an endpoint_state in
endpoint_state_map.

After commit 425e3b1182 (gossip: Introduce
direct failure detector), gossiper::mark_dead and gossiper::real_mark_alive
can yield in the middle of the function. It is possible that
endpoint_state can be removed, causing use-after-free to access it.

To fix, make a copy before we yield.

Fixes #8859

Closes #8862

(cherry picked from commit 7a32cab524)
2021-11-15 13:23:11 +02:00
Takuya ASADA
175d004513 scylla_util.py: On is_gce(), return False when it's on GKE
GKE metadata server does not provide same metadata as GCE, we should not
return True on is_gce().
So try to fetch machine-type from metadata server, return False if it
404 not found.

Fixes #9471

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #9582

(cherry picked from commit 9b4cf8c532)
2021-11-15 13:18:13 +02:00
Asias He
091b794742 repair: Return HTTP 400 when repiar id is not found
There are two APIs for checking the repair status and they behave
differently in case the id is not found.

```
{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_async/system_auth?id=999", "duration": "1ms",
"status": 400, "bytes": 49, "dump": "HTTP/1.1 400 Bad
Request\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 400}"}

{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_status?id=999&timeout=1", "duration": "0ms",
"status": 500, "bytes": 49, "dump": "HTTP/1.1 500 Internal Server
Error\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 500}"}
```

The correct status code is 400 as this is a parameter error and should
not be retried.

Returning status code 500 makes smarter http clients retry the request
in hopes of server recovering.

After this patch:

curl -X PGET
'http://127.0.0.1:10000/storage_service/repair_async/system_auth?id=9999'
{"message": "unknown repair id 9999", "code": 400}

curl -X GET
'http://127.0.0.1:10000/storage_service/repair_status?id=9999'
{"message": "unknown repair id 9999", "code": 400}

Fixes #9576

Closes #9578

(cherry picked from commit f5f5714aa6)
2021-11-15 13:16:08 +02:00
Calle Wilund
8be87bb0b1 cdc: fix broken function signature in maybe_back_insert_iterator
Fixes #9103

compare overload was declared as "bool" even though it is a tri-cmp.
causes us to never use the speed-up shortcut (lessen search set),
in turn meaning more overhead for collections.

Closes #9104

(cherry picked from commit 59555fa363)
2021-11-15 13:13:51 +02:00
Takuya ASADA
a84142705a scylla_io_setup: handle nr_disks on GCP correctly
nr_disks is int, should not be string.

Fixes #9429

Closes #9430

(cherry picked from commit 3b798afc1e)
2021-11-15 13:06:40 +02:00
Michał Chojnowski
fc32534aee utils: fragment_range: fix FragmentedView utils for views with empty fragments
The copying and comparing utilities for FragmentedView are not prepared to
deal with empty fragments in non-empty views, and will fall into an infinite
loop in such case.
But data coming in result_row_view can contain such fragments, so we need to
fix that.

Fixes #8398.

Closes #8397

(cherry picked from commit f23a47e365)
2021-11-15 12:57:21 +02:00
Hagit Segev
4e526ad88a release: prepare for 4.4.7 2021-11-14 19:54:05 +02:00
Avi Kivity
176f253aa3 build: clobber user/group info from node_exporter tarball
node_exporter is packaged with some random uid/gid in the tarball.
When extracting it as an ordinary user this isn't a problem, since
the uid/gid are reset to the current user, but that doesn't happen
under dbuild since `tar` thinks the current user is root. This causes
a problem if one wants to delete the build directory later, since it
becomes owned by some random user (see /etc/subuid)

Reset the uid/gid infomation so this doesn't happen.

Closes #9579

Fixes #9610.

(cherry picked from commit e1817b536f)
2021-11-10 14:19:28 +02:00
Nadav Har'El
c49cd5d9b6 alternator: fix bug in ReturnValues=ALL_NEW
This patch fixes a bug in UpdateItem's ReturnValues=ALL_NEW, which in
some cases returned the OLD (pre-modification) value of some of the
attributes, instead of its NEW value.

The bug was caused by a confusion in our JSON utility function,
rjson::set(), which sounds like it can set any member of a map, but in
fact may only be used to add a *new* member - if a member with the same
name (key) already existed, the result is undefined (two values for the
same key). In ReturnValues=ALL_NEW we did exactly this: we started with
a copy of the original item, and then used set() to override some of the
members. This is not allowed.

So in this patch, we introduce a new function, rjson::replace(), which
does what we previously thought that rjson::set() does - i.e., replace a
member if it exists, or if not, add it. We call this function in
the ReturnValues=ALL_NEW code.

This patch also adds a test case that reproduces the incorrect ALL_NEW
results - and gets fixed by this patch.

In an upcoming patch, we should rename the confusingly-named set()
functions and audit all their uses. But we don't do this in this patch
yet. We just add some comments to clarify what set() does - but don't
change it, and just add one new function for replace().

Fixes #9542

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211104134937.40797-1-nyh@scylladb.com>
(cherry picked from commit b95e431228)
2021-11-08 14:10:36 +02:00
Dejan Mircevski
5d4abb521b types: Unreverse tuple subtype for serialization
When a tuple value is serialized, we go through every element type and
use it to serialize element values.  But an element type can be
reversed, which is artificially different from the type of the value
being read.  This results in a server error due to the type mismatch.
Fix it by unreversing the element type prior to comparing it to the
value type.

Fixes #7902

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8316

(cherry picked from commit 318f773d81)
2021-11-07 19:25:43 +02:00
Asias He
cfc2562dec storage_service: Abort restore_replica_count when node is removed from the cluster
Consider the following procedure:

- n1, n2, n3
- n3 is down
- n1 runs nodetool removenode uuid_of_n3 to removenode from n3 the
  cluster
- n1 is down in the middle of removenode operation

Node n1 will set n3 to removing gossip status during removenode
operation. Whenever existing nodes learn a node is in removing gossip
status, they will call restore_replica_count to stream data from other
nodes for the ranges n3 loses if n3 was removed from the cluster. If
the streaming fails, the streaming will sleep and retry. The current
max number of retry attempts is 5. The sleep interval starts at 60
seconds and increases 1.5 times per sleep.

This can leave the cluster in a bad state. For example, nodes can go
out of disk space if the streaming continues.  We need a way to abort
such streaming attempts.

To abort the removenode operation and forcely remove the node, users
can run `nodetool removenode force` on any existing nodes to move the
node from removing gossip status to removed gossip status. However,
the restore_replica_count will not be aborted.

In this patch, a status checker is added in restore_replica_count, so
that once a node is in removed gossip status, restore_replica_count
will be aborted.

This patch is for older releases without the new NODE_OPS_CMD
infrastructure where such abort will happen automatically in case of
error.

Fixes #8651

Closes #8655

(cherry picked from commit 0858619cba)
2021-11-02 17:26:35 +02:00
Benny Halevy
4a1171e2fa large_data_handle: add sstable name to log messages
Although the sstable name is part of the system.large_* records,
it is not printed in the log.
In particular, this is essential for the "too many rows" warning
that currently does not record a row in any large_* table
so we can't correlate it with a sstable.

Fixes #9524

Test: unit(dev)
DTest: wide_rows_test.py

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211027074104.1753093-1-bhalevy@scylladb.com>
(cherry picked from commit a21b1fbb2f)
2021-10-29 10:48:35 +03:00
Asias He
542a508c50 repair: Handle everywhere_topology in bootstrap_with_repair
The everywhere_topology returns the number of nodes in the cluster as
RF. This makes only streaming from the node losing the range impossible
since no node is losing the range after bootstrap.

Shortcut to stream from all nodes in local dc in case the keyspace is
everywhere_topology.

Fixes #8503

(cherry picked from commit 3c36517598)
2021-10-28 18:56:23 +03:00
Hagit Segev
dd018d4de4 release: prepare for 4.4.6 2021-10-28 18:00:13 +03:00
Benny Halevy
70098a1991 date_tiered_manifest: get_now: fix use after free of sstable_list
The sstable_list is destroyed right after the temporary
lw_shared_ptr<sstable_list> returned from `cf.get_sstables()`
is dereferenced.

Fixes #9138

Test: unit(dev)
DTest: resharding_test.py:ReshardingTombstones_with_DateTieredCompactionStrategy.disable_tombstone_removal_during_reshard_test (debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210804075813.42526-1-bhalevy@scylladb.com>
(cherry picked from commit 3ad0067272)
2021-10-28 11:24:03 +03:00
Jan Ciolek
008f2ff370 cql3: Fix need_filtering on indexed table
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes #8991.
Fixes #7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 54149242b4)
2021-10-28 11:24:03 +03:00
Benny Halevy
f71cdede5e bytes_ostream: max_chunk_size: account for chunk header
Currently, if the data_size is greater than
max_chunk_size - sizeof(chunk), we end up
allocating up to max_chunk_size + sizeof(chunk) bytes,
exceeding buf.max_chunk_size().

This may lead to allocation failures, as seen in
https://github.com/scylladb/scylla/issues/7950,
where we couldn't allocate 131088 (= 128K + 16) bytes.

This change adjusted the expose max_chunk_size()
to be max_alloc_size (128KB) - sizeof(chunk)
so that the allocated chunks would normally be allocated
in 128KB chunks in the write() path.

Added a unit test - test_large_placeholder that
stresses the chunk allocation path from the
write_place_holder(size) entry point to make
sure it handles large chunk allocations correctly.

Refs #7950
Refs #8081

Test: unit(release), bytes_ostream_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210303143413.902968-1-bhalevy@scylladb.com>
(cherry picked from commit ff5b42a0fa)
2021-10-28 11:24:03 +03:00
Botond Dénes
0fd17af2ee evictable_reader: reset _range_override after fast-forwarding
`_range_override` is used to store the modified range the reader reads
after it has to be recreated (when recreating a reader it's read range
is reduced to account for partitions it already read). When engaged,
this field overrides the `_pr` field as the definitive range the reader
is supposed to be currently reading. Fast forwarding conceptually
overrides the range the reader is currently reading, however currently
it doesn't reset the `_range_override` field. This resulted in
`_range_override` (containing the modified pre-fast-forward range)
incorrectly overriding the fast-forwarded-to range in `_pr` when
validating the first partition produced by the just recreated reader,
resulting in a false-positive validation failure.

Fixes: #8059

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>
[avi: add #include]
(cherry picked from commit c3b4c3f451)
2021-10-28 11:12:01 +03:00
Benny Halevy
77cb6596c4 utils: phased_barrier: advance_and_await: make noexcept
As a function returning a future, simplify
its interface by handling any exceptions and
returning an exceptional future instead of
propagating the exception.

In this specific case, throwing from advance_and_await()
will propagate through table::await_pending_* calls
short-circuiting a .finally clause in table::stop().

Also, mark as noexcept methods of class table calling
advance_and_await and table::await_pending_ops that depends on them.

Fixes #8636

A followup patch will convert advance_and_await to a coroutine.
This is done separately to facilitate backporting of this patch.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210511161407.218402-1-bhalevy@scylladb.com>
(cherry picked from commit c0dafa75d9)
2021-10-13 12:26:12 +03:00
Avi Kivity
c81c7d2d89 Merge 'rjson: Add throwing allocator' from Piotr Sarna
This series adds a wrapper for the default rjson allocator which throws on allocation/reallocation failures. It's done to work around several rapidjson (the underlying JSON parsing library) bugs - in a few cases, malloc/realloc return value is not checked, which results in dereferencing a null pointer (or an arbitrary pointer computed as 0 + `size`, with the `size` parameter being provided by the user). The new allocator will throw an `rjson:error` if it fails to allocate or reallocate memory.
This series comes with unit tests which checks the new allocator behavior and also validates that an internal rapidjson structure which we indirectly rely upon (Stack) is not left in invalid state after throwing. The last part is verified by the fact that its destructor ran without errors.

Fixes #8521
Refs #8515

Tests:
 * unit(release)
 * YCSB: inserting data similar to the one mentioned in #8515 - 1.5MB objects clustered in partitions 30k objects in size - nothing crashed during various YCSB workloads, but nothing also crashed for me locally before this patch, so it's not 100% robust
 relevant YCSB workload config for using 1.5MB objects:
```yaml
fieldcount=150
fieldlength=10000
```

Closes #8529

* github.com:scylladb/scylla:
  test: add a test for rjson allocation
  test: rename alternator_base64_test to alternator_unit_test
  rjson: add a throwing allocator

(cherry picked from commit c36549b22e)
2021-10-12 13:57:15 +03:00
Benny Halevy
b3a762f179 streaming: stream_session: do not escape curly braces in format strings
Those turn into '{}' in the formatted strings and trigger
a logger error in the following sstlog.warn(err.c_str())
call.

Fixes #8436

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210408173048.124417-1-bhalevy@scylladb.com>
(cherry picked from commit 76cd315c42)
2021-10-12 13:49:24 +03:00
Calle Wilund
2bba07bdf4 table: ensure memtable is actually in memtable list before erasing
Fixes #8749

if a table::clear() was issued while we were flushing a memtable,
the memtable is already gone from list. We need to check this before
erase. Otherwise we get random memory corruption via
std::vector::erase

v2:
* Make interface more set-like (tolerate non-existance in erase).

Closes #8904

(cherry picked from commit 373fa3fa07)
2021-10-12 13:47:33 +03:00
Benny Halevy
87bfb57ccf utils: merge_to_gently: prevent stall in std::copy_if
std::copy_if runs without yielding.

See https://github.com/scylladb/scylla/issues/8897#issuecomment-867522480

Note that the standard states that no iterators or references are invalidated
on insert so we can keep inserting before last1 when merging the
remainder of list2 at the tail of list1.

Fixes #8897

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 453e7c8795)
2021-10-12 13:05:58 +03:00
Michael Livshin
6ca8590540 avoid race between compaction and table stop
Also add a debug-only compaction-manager-side assertion that tests
that no new compaction tasks were submitted for a table that is being
removed (debug-only because not constant-time).

Fixes #9448.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20211007110416.159110-1-michael.livshin@scylladb.com>
(cherry picked from commit e88891a8af)
2021-10-12 12:51:44 +03:00
Takuya ASADA
da57d6c7cd scylla_cpuscaling_setup: add --force option
To building Ubuntu AMI with CPU scaling configuration, we need force
running mode for scylla_cpuscaling_setup, which run setup without
checking scaling_governor support.

See scylladb/scylla-machine-image#204

Closes #9326

(cherry picked from commit f928dced0c)
2021-10-05 16:20:22 +03:00
Takuya ASADA
61469d62b8 scylla_ntp_setup: support 'pool' directive on ntp.conf
Currently, scylla_ntp_setup only supports 'server' directive, we should
support 'pool' too.

Fixes #9393

Closes #9397
2021-10-03 14:11:54 +03:00
Takuya ASADA
c63092038e scylla_cpuscaling_setup: disable ondemand.service on Ubuntu
On Ubuntu, scaling_governor becomes powersave after rebooted, even we configured cpufrequtils.
This is because ondemand.service, it unconditionally change scaling_governor to ondemand or powersave.
cpufrequtils will start before ondemand.service, scaling_governor overwrite by ondemand.service.
To configure scaling_governor correctly, we have to disable this service.

Fixes #9324

Closes #9325

(cherry picked from commit cd7fe9a998)
2021-10-03 14:08:37 +03:00
Raphael S. Carvalho
cb7fbb859b compaction_manager: prevent unbounded growth of pending tasks
There will be unbounded growth of pending tasks if they are submitted
faster than retiring them. That can potentially happen if memtables
are frequently flushed too early. It was observed that this unbounded
growth caused task queue violations as the queue will be filled
with tons of tasks being reevaluated. By avoiding duplication in
pending task list for a given table T, growth is no longer unbounded
and consequently reevaluation is no longer aggressive.

Refs #9331.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210930125718.41243-1-raphaelsc@scylladb.com>
(cherry picked from commit 52302c3238)
2021-10-03 13:11:14 +03:00
Yaron Kaikov
01920c1293 release: prepare for 4.4.5 2021-09-23 14:37:52 +03:00
Eliran Sinvani
fd64cae856 dist: rpm: Add specific versioning and python3 dependency
The Red Hat packages were missing two things, first the metapackage
wasn't dependant at all in the python3 package and second, the
scylla-server package dependencies didn't contain a version as part
of the dependency which can cause to some problems during upgrade.
Doing both of the things listed here is a bit of an overkill as either
one of them separately would solve the problem described in #XXXX
but both should be applied in order to express the correct concept.

Fixes #8829

Closes #8832

(cherry picked from commit 9bfb2754eb)
2021-09-12 16:01:15 +03:00
Calle Wilund
b1032a2699 snapshot: Add filter to check for existing snapshot
Fixes #8212

Some snapshotting operations call in on a single table at a time.
When checking for existing snapshots in this case, we should not
bother with snapshots in other tables. Add an optional "filter"
to check routine, which if non-empty includes tables to check.

Use case is "scrub" which calls with a limited set of tables
to snapshot.

Closes #8240

(cherry picked from commit f44420f2c9)
2021-09-12 11:16:12 +03:00
Avi Kivity
90941622df Merge "evictable_readers: don't drop static rows, drop assumption about snapshot isolation" from Botond
"
This mini-series fixes two loosely related bugs around reader recreation
in the evictable reader (related by both being around reader
recreation). A unit test is also added which reproduces both of them and
checks that the fixes indeed work. More details in the patches
themselves.
This series replaces the two independent patches sent before:
* [PATCH v1] evictable_reader: always reset static row drop flag
* [PATCH v1] evictable_reader: relax partition key check on reader
  recreation

As they depend on each other, it is easier to add a test if they are in
a series.

Fixes: #8923
Fixes: #8893

Tests: unit(dev, mutation_reader_test:debug)
"

* 'evictable-reader-recreation-more-bugs/v1' of https://github.com/denesb/scylla:
  test: mutation_reader_test: add more test for reader recreation
  evictable_reader: relax partition key check on reader recreation
  evictable_reader: always reset static row drop flag

(cherry picked from commit 4209dfd753)
2021-09-06 17:28:49 +03:00
Avi Kivity
4250ab27d8 Update seastar submodule (perftune failure on bond NIC)
* seastar 4b7d434965...4a58d76fea (1):
  > perftune.py: instrument bonding tuning flow with 'nic' parameter

Fixes #9225.
2021-08-19 16:59:41 +03:00
Takuya ASADA
475e0d0893 scylla_cpuscaling_setup: change scaling_governor path
On some environment /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
does not exist even it supported CPU scaling.
Instead, /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor is
avaliable on both environment, so we should switch to it.

Fixes #9191

Closes #9193

(cherry picked from commit e5bb88b69a)
2021-08-12 12:10:15 +03:00
Raphael S. Carvalho
27333587a8 compaction: Prevent tons of compaction of fully expired sstable from happening in parallel
Compaction manager can start tons of compaction of fully expired sstable in
parallel, which may consume a significant amount of resources.
This problem is caused by weight being released too early in compaction, after
data is all compacted but before table is called to update its state, like
replacing sstables and so on.
Fully expired sstables aren't actually compacted, so the following can happen:
- compaction 1 starts for expired sst A with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 2 starts for expired sst B with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 3 starts for expired sst C with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 1 is done updating table state, so it finally completes and
releases all the resources.
- compaction 2 is done updating table state, so it finally completes and
releases all the resources.
- compaction 3 is done updating table state, so it finally completes and
releases all the resources.

This happens because, with expired sstable, compaction will release weight
faster than it will update table state, as there's nothing to be compacted.

With my reproducer, it's very easy to reach 50 parallel compactions on a single
shard, but that number can be easily worse depending on the amount of sstables
with fully expired data, across all tables. This high parallelism can happen
only with a couple of tables, if there are many time windows with expired data,
as they can be compacted in parallel.

Prior to 55a8b6e3c9, weight was released earlier in compaction, before
last sstable was sealed, but right now, there's no need to release weight
earlier. Weight can be released in a much simpler way, after the compaction is
actually done. So such compactions will be serialized from now on.

Fixes #8710.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com>

[avi: drop now unneeded storage_service_for_tests]

(cherry picked from commit a7cdd846da)
2021-08-10 18:16:47 +03:00
Nadav Har'El
0cfe0e8c8e secondary index: fix regression in CREATE INDEX IF NOT EXISTS
The recent commit 0ef0a4c78d added helpful
error messages in case an index cannot be created because the intended
name of its materialized view is already taken - but accidentally broke
the "CREATE INDEX IF NOT EXISTS" feature.

The checking code was correct, but in the wrong place: we need to first
check maybe the index already exists and "IF NOT EXISTS" was chosen -
and only do this new error checking if this is not the case.

This patch also includes a cql-pytest test for reproducing this bug.
The bug is also reproduced by the translated Cassandra unit tests
    cassandra_tests/validation/entities/secondary_index_test.py::
    testCreateAndDropIndex
and this is how I found this bug. After these patch, all these tests
pass.

Fixes #8717.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210526143635.624398-1-nyh@scylladb.com>
(cherry picked from commit 97e827e3e1)
2021-08-10 17:41:51 +03:00
Nadav Har'El
cb3225f2de Merge 'Fix index name conflicts with regular tables' from Piotr Sarna
When an index is created without an explicit name, a default name
is chosen. However, there was no check if a table with conflicting
name already exists. The check is now in place and if any conflicts
are found, a new index name is chosen instead.
When an index is created *with* an explicit name and a conflicting
regular table is found, index creation should simply fail.

This series comes with a test.

Fixes #8620
Tests: unit(release)

Closes #8632

* github.com:scylladb/scylla:
  cql-pytest: add regression tests for index creation
  cql3: fail to create an index if there is a name conflict
  database: check for conflicting table names for indexes

(cherry picked from commit cee4c075d2)
2021-08-10 12:56:52 +03:00
Hagit Segev
69daa9fd00 release: prepare for 4.4.4 2021-08-01 14:22:01 +03:00
Piotr Jastrzebski
f91cea66a6 api: use proper type to reduce partition count
Partition count is of a type size_t but we use std::plus<int>
to reduce values of partition count in various column families.
This patch changes the argument of std::plus to the right type.
Using std::plus<int> for size_t compiles but does not work as expected.
For example plus<int>(2147483648LL, 1LL) = -2147483647 while the code
would probably want 2147483649.

Fixes #9090

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #9074

(cherry picked from commit 90a607e844)
2021-07-27 12:38:22 +03:00
Raphael S. Carvalho
9dce1e4b2b sstables: Close promoted index readers when advancing to next summary index
Problem fixed on master since 5ed559c. So branch-4.5 and up aren't affected.

Index reader fails to close input streams of promoted index readers when advancing
to next summary entry, so Scylla can abort as a result of a stream being destroyed
while there were reads in progress. This problem was seen when row cache issued
a fast forward, so index reader was asked to advance to next summary entry while
the previous one still had reads in progress.
By closing the list of index readers when there's only one owner holding it,
the problem is safely fixed, because it cannot happen that an index_bound like
_lower_bound or _upper_bound will be left with a list that's already closed.

Fixes #9049.

test: mode(dev, debug).

No observable perf regression:

BEFORE:

   read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         8.168640            4    100000      12242        108      12262      11982    50032.2  50049    6403116   20707       0        0        8        8        0        0        0  83.3%
-> 1       1        22.257916            4     50000       2246          3       2249       2238   150025.0 150025    6454272  100001       0    49999   100000   149999        0        0        0  54.7%
-> 1       8         9.384961            4     11112       1184          5       1184       1178    77781.2  77781    1439328   66618   11111        1    33334    44444        0        0        0  44.0%
-> 1       16        4.976144            4      5883       1182          6       1184       1173    41180.0  41180     762053   35264    5882        0    17648    23530        0        0        0  44.1%
-> 1       32        2.582744            4      3031       1174          4       1175       1167    21216.0  21216     392619   18176    3031        0     9092    12122        0        0        0  43.8%
-> 1       64        1.308410            4      1539       1176          2       1178       1173    10772.0  10772     199353    9233    1539        0     4616     6154        0        0        0  44.0%
-> 1       256       0.331037            4       390       1178         12       1190       1165     2729.0   2729      50519    2338     390        0     1169     1558        0        0        0  44.0%
-> 1       1024      0.085108            4        98       1151          7       1155       1141      685.0    685      12694     587      98        0      293      390        0        0        0  42.9%
-> 1       4096      0.024393            6        25       1025          5       1029       1020      174.0    174       3238     149      25        0       74       98        0        0        0  37.4%
-> 64      1         8.765446            4     98462      11233         16      11236      11182    54642.0  54648    6405470   23632       1     1538     4615     4615        0        0        0  79.3%
-> 64      8         8.456430            4     88896      10512         48      10582      10464    55578.0  55578    6405971   24031    4166        0     5553     5553        0        0        0  77.3%
-> 64      16        7.798197            4     80000      10259        108      10299      10077    51248.0  51248    5922500   22160    4996        0     4998     4998        0        0        0  74.8%
-> 64      32        6.605148            4     66688      10096         64      10168      10033    42715.0  42715    4936359   18796    4164        0     4165     4165        0        0        0  75.5%
-> 64      64        4.933287            4     50016      10138         28      10189      10111    32039.0  32039    3702428   14106    3124        0     3125     3125        0        0        0  75.3%
-> 64      256       1.971701            4     20032      10160         57      10347      10103    12831.0  12831    1482993    5731    1252        0     1250     1250        0        0        0  74.1%
-> 64      1024      0.587026            4      5888      10030         84      10277       9946     3770.0   3770     435895    1635     368        0      366      366        0        0        0  74.6%
-> 64      4096      0.157401            4      1600      10165         69      10202       9698     1023.0   1023     118449     455     100        0       98       98        0        0        0  73.9%

AFTER:

   read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         8.191639            4    100000      12208         46      12279      12161    50031.2  50025    6403108   20243       0        0        0        0        0        0        0  87.0%
-> 1       1        22.933121            4     50000       2180         36       2198       2115   150025.0 150025    6454272  100001       0    49999   100000   149999        0        0        0  54.9%
-> 1       8         9.471735            4     11112       1173          5       1178       1168    77781.2  77781    1439328   66663   11111        0    33334    44445        0        0        0  44.6%
-> 1       16        5.001569            4      5883       1176          2       1176       1170    41180.0  41180     762053   35296    5882        1    17648    23529        0        0        0  44.6%
-> 1       32        2.587069            4      3031       1172          1       1173       1164    21216.0  21216     392619   18185    3031        1     9092    12121        0        0        0  44.8%
-> 1       64        1.310747            4      1539       1174          3       1177       1171    10772.0  10772     199353    9233    1539        0     4616     6154        0        0        0  44.9%
-> 1       256       0.335490            4       390       1162          2       1167       1161     2729.0   2729      50519    2338     390        0     1169     1558        0        0        0  45.7%
-> 1       1024      0.081944            4        98       1196         21       1210       1162      685.0    685      12694     585      98        0      293      390        0        0        0  46.2%
-> 1       4096      0.022266            6        25       1123          3       1125       1105      174.0    174       3238     149      24        0       74       98        0        0        0  41.9%
-> 64      1         8.731741            4     98462      11276         45      11417      11231    54642.0  54640    6405470   23686       0     1538     4615     4615        0        0        0  80.2%
-> 64      8         8.396247            4     88896      10588         19      10596      10560    55578.0  55578    6405971   24275    4166        0     5553     5553        0        0        0  77.6%
-> 64      16        7.700995            4     80000      10388         88      10405      10221    51248.0  51248    5922500   22100    5000        0     4998     4998        0        0        0  76.4%
-> 64      32        6.517276            4     66688      10232         31      10342      10201    42715.0  42715    4936359   19013    4164        0     4165     4165        0        0        0  75.3%
-> 64      64        4.898669            4     50016      10210         60      10291      10150    32039.0  32039    3702428   14110    3124        0     3125     3125        0        0        0  74.4%
-> 64      256       1.969972            4     20032      10169         22      10173      10091    12831.0  12831    1482993    5660    1252        0     1250     1250        0        0        0  74.3%
-> 64      1024      0.575180            4      5888      10237         84      10316      10028     3770.0   3770     435895    1656     368        0      366      366        0        0        0  74.6%
-> 64      4096      0.158503            4      1600      10094         81      10195      10014     1023.0   1023     118449     460     100        0       98       98        0        0        0  73.5%

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210722180302.64675-1-raphaelsc@scylladb.com>
2021-07-25 14:03:04 +03:00
Asias He
99b8c04a40 repair: Consider memory bloat when calculate repair parallelism
The repair parallelism is calculated by the number of memory allocated to
repair and memory usage per repair instance. Currently, it does not
consider memory bloat issues (e.g., issue #8640) which cause repair to
use more memory and cause std::bad_alloc.

Be more conservative when calculating the parallelism to avoid repair
using too much memory.

Fixes #8641

Closes #8652

(cherry picked from commit b8749f51cb)
2021-07-15 13:02:01 +03:00
Takuya ASADA
74cd6928c0 scylla-fstrim.timer: drop BindsTo=scylla-server.service
To avoid restart scylla-server.service unexpectedly, drop BindsTo=
from scylla-fstrim.timer.

Fixes #8921

Closes #8973

(cherry picked from commit def81807aa)
2021-07-08 10:06:42 +03:00
Avi Kivity
a178098277 Update tools/java submodule for rack/dc properties
* tools/java aab793d9f5...14e635e5de (1):
  > cassandra.in.sh: Add path to rack/dc properties file to classpath
Fixes #7930.
2021-07-08 09:53:15 +03:00
Takuya ASADA
da1a9c7bc7 dist/redhat: fix systemd unit name of scylla-node-exporter
systemd unit name of scylla-node-exporter is
scylla-node-exporter.service, not node-exporter.service.

Fixes #8966

Closes #8967

(cherry picked from commit f19ebe5709)
2021-07-07 18:37:55 +03:00
Takuya ASADA
3666bb84a7 dist: stop removing /etc/systemd/system/*.mount on package uninstall
Listing /etc/systemd/system/*.mount as ghost file seems incorrect,
since user may want to keep using RAID volume / coredump directory after
uninstalling Scylla, or user may want to upgrade enterprise version.

Also, we mixed two types of files as ghost file, it should handle differently:
 1. automatically generated by postinst scriptlet
 2. generated by user invoked scylla_setup

The package should remove only 1, since 2 is generated by user decision.

However, just dropping .mount from %files section causes another
problem, rpm will remove these files during upgrade, instead of
uninstall (#8924).

To fix both problem, specify .mount files as "%ghost %config".
It will keep files both package upgrade and package remove.

See scylladb/scylla-enterprise#1780

Closes #8810
Closes #8924

Closes #8959

(cherry picked from commit f71f9786c7)
2021-07-07 18:37:55 +03:00
Pavel Emelyanov
df6d471e08 hasher: More picky noexcept marking of feed_hash()
Commit 5adb8e555c marked the ::feed_hash() and a visitor lambda of
digester::feed_hash() as noexcept. This was quite recklesl as the
appending_hash<>::operator()s called by ::feed_hash() are not all
marked noexcept. In particular, the appending_hash<row>() is not
such and seem to throw.

The original intent of the mentioned commit was to facilitate the
partition_hasher in repair/ code. The hasher itself had been removed
by the 0af7a22c21, so it no longer needs the feed_hash-s to be
noexcepts.

The fix is to inherit noexcept from the called hashers, but for the
digester::feed_hash part the noexcept is just removed until clang
compilation bug #50994 is fixed.

fixes: #8983
tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210706153608.4299-1-xemul@scylladb.com>
(cherry picked from commit 63a2fed585)
2021-07-07 18:36:18 +03:00
Raphael S. Carvalho
92b85da380 LCS: reshape: Fix overlapping check when determining if a sstable set is disjoint
Wrong comparison operator is used when checking for overlapping. It
would miss overlapping when last key of a sstable is equal to the first
key of another sstable that comes next in the set, which is sorted by
first key.

Fixes #8531.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 39ecddbd34)
2021-07-07 14:04:22 +03:00
Juliusz Stasiewicz
d214d91a09 tests: Adjusted tests for DC checking in NTS
CQL test relied on quietly acceptiong non-existing DCs, so it had
to be removed. Also, one boost-test referred to nonexisting
`datacenter2` and had to be removed.

(cherry picked from commit 97bb15b2f2)
2021-06-21 17:53:47 +03:00
Nadav Har'El
a7a1e59594 cql-pytest: remove "xfail" tag from two passing tests
Issue #7595 was already fixed last week, in commit
b6fb5ee912, so the two tests which failed
because of this issue no longer fail and their "xfail" tag can be removed.

Refs #7595.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210216160606.1172855-1-nyh@scylladb.com>
(cherry picked from commit 946e63ee6e)
2021-06-20 19:37:28 +03:00
Juliusz Stasiewicz
856aeb5ddb locator: Check DC names in NTS
The same trick is used as in C*:
79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)

Fixes #7595

(cherry picked from commit b6fb5ee912)
2021-06-20 19:24:29 +03:00
Piotr Sarna
61659fdbdb Merge 'view: fix use-after-move when handling view update failures'
Backport of 6726fe79b6.

The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.

Fixes #8830
Tests: unit(release),
       dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)

Closes #8834

* backport-6726fe7-4.4:
  view: fix use-after-move when handling view update failures
  db,view: explicitly move the mutation to its helper function
  db,view: pass base token by value to mutate_MV
2021-06-16 14:15:12 +02:00
Piotr Sarna
b06e9447b1 view: fix use-after-move when handling view update failures
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.

Refs #8830
Tests: unit(release),
       dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)

(cherry picked from commit 8a049c9116)
2021-06-16 13:40:57 +02:00
Piotr Sarna
6a407984d8 db,view: explicitly move the mutation to its helper function
The `apply_to_remote_endpoints` helper function used to take
its `mut` parameter by reference, but then moved the value from it,
which is confusing and prone to errors. Since the value is moved-from,
let's pass it to the helper function as rvalue ref explicitly.

(cherry picked from commit 7cdbb7951a)
2021-06-16 13:38:39 +02:00
Piotr Sarna
74df68c67f db,view: pass base token by value to mutate_MV
The base token is passed cross-continuations, so the current way
of passing it by const reference probably only works because the token
copying is cheap enough to optimize the reference out.
Fix by explicitly taking the token by value.

(cherry picked from commit 88d4a66e90)
2021-06-16 13:38:01 +02:00
Raphael S. Carvalho
a6b3a2b945 LCS: Fix terrible write amplification when reshaping level 0
LCS reshape is basically 'major compacting' level 0 until it contains less than
N sstables.

That produces terrible write amplification, because any given byte will be
compacted (initial # of sstables / max_threshold (32)) times. So if L0 initially
contained 256 ssts, there would be a WA of about 8.

This terrible write amplification can be reduced by performing STCS instead on
L0, which will leave L0 in a good shape without hurting WA as it happens
now.

Fixes #8345.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322150655.27011-1-raphaelsc@scylladb.com>
(cherry picked from commit bcbb39999b)
2021-06-14 20:27:41 +03:00
Michał Chojnowski
2cf998e418 cdc: log: fix use-after-free in process_bytes_visitor
Due to small value optimization used in `bytes`, views to `bytes` stored
in `vector` can be invalidated when the vector resizes, resulting in
use-after-free and data corruption. Fix that.

Fixes #8117

(cherry picked from commit 8cc4f39472)
2021-06-13 19:06:25 +03:00
Botond Dénes
6a23208ce4 mutation_test: test_mutation_diff_with_random_generator: compact input mutations
This test checks that `mutation_partition::difference()` works correctly.
One of the checks it does is: m1 + m2 == m1 + (m2 - m1).
If the two mutations are identical but have compactable data, e.g. a
shadowable tombstone shadowed by a row marker, the apply will collapse
these, causing the above equality check to fail (as m2 - m1 is null).
To prevent this, compact the two input mutations.

Fixes: #8221
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210310141118.212538-1-bdenes@scylladb.com>
(cherry picked from commit cf28552357)
2021-06-13 18:30:43 +03:00
Takuya ASADA
94d73d2d26 scylla_coredump_setup: avoid coredump failure when hard limit of coredump is set to zero
On the environment hard limit of coredump is set to zero, coredump test
script will fail since the system does not generate coredump.
To avoid such issue, set ulimit -c 0 before generating SEGV on the script.

Note that scylla-server.service can generate coredump even ulimit -c 0
because we set LimitCORE=infinity on its systemd unit file.

Fixes #8238

Closes #8245

(cherry picked from commit af8eae317b)
2021-06-13 18:27:03 +03:00
Avi Kivity
033d56234b Update seastar submodule (nested exception logging)
* seastar 61939b5b8a...4b7d434965 (2):
  > utils/log.cc: fix nested_exception logging (again)
  > log: skip on unknown nested mixing instead of stopping the logging

Fixes #8327.
2021-06-13 18:22:59 +03:00
Benny Halevy
35d89298da test: commitlog_test: test_allocation_failure: fill memory using smaller allocations
commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output).
So if there are discontiguous small memory blocks, they can be used to satisfy
an allocation even if no contiguous memory blocks are available.

To prevent that, as Avi suggested, this change allocates in 128K blocks
and frees the last one to succeed (so that we won't fail on allocating continuations).

Fixes #8028

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>
(cherry picked from commit ca6f5cb0bc)
2021-06-10 19:35:40 +03:00
Dejan Mircevski
8011d181b5 cql3: Skip indexed column for CK restrictions
When querying an index table, we assemble clustering-column
restrictions for that query by going over the base table token,
partition columns, and clustering columns.  But if one of those
columns is the indexed column, there is a problem; the indexed column
is the index table's partition key, not clustering key.  We end up
with invalid clustering slice, which can cause problems downstream.

Fix this by skipping the indexed column when assembling the clustering
restrictions.

Tests: unit (dev)

Fixes #7888

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8320

(cherry picked from commit 0bd201d3ca)
2021-06-10 10:43:14 +03:00
Hagit Segev
bfafb84567 release: prepare for 4.4.3 2021-06-09 19:51:57 +03:00
Yaron Kaikov
0d9c09ed04 install.sh: Setup aio-max-nr upon installation
This is a follow up change to #8512.

Let's add aio conf file during scylla installation process and make sure
we also remove this file when uninstall Scylla

As per Avi Kivity's suggestion, let's set aio value as static
configuration, and make it large enough to work with 500 cpus.

Closes #8650

Refs: #8713

(cherry picked from commit dd453ffe6a)
2021-06-07 16:30:00 +03:00
Yaron Kaikov
36a4eba22e scylla_io_setup: configure "aio-max-nr" before iotune
On severl instance types in AWS and Azure, we get the following failure
during scylla_io_setup process:
```
ERROR 2021-04-14 07:50:35,666 [shard 5] seastar - Could not setup Async
I/O: Resource temporarily unavailable. The most common cause is not
enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that
number or reducing the amount of logical CPUs available for your
application
```

We have scylla_prepare:configure_io_slots() running before the
scylla-server.service start, but the scylla_io_setup is taking place
before

1) Let's move configure_io_slots() to scylla_util.py since both
   scylla_io_setup and scylla_prepare are import functions from it
2) cleanup scylla_prepare since we don't need the same function twice
3) Let's use configure_io_slots() during scylla_io_setup to avoid such
failure

Fixes: #8587

Closes #8512

Refs: #8713

(cherry picked from commit 588a065304)
2021-06-07 16:29:38 +03:00
Nadav Har'El
e9b1f10654 Update tools/java submodule with backported patches
* tools/java 6ca351c221...aab793d9f5 (2):
  > nodetool: alternate way to specify table name which includes a dot
  > nodetool: do no treat table name with dot as a secondary index

Fixes #6521

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-07 09:38:53 +03:00
Nadav Har'El
6057be3f42 alternator: fix equality check of nested document containing a set
In issue #5021 we noticed that the equality check in Alternator's condition
expressions needs to handle sets differently - we need to compare the set's
elements ignoring their order. But the implementation we added to fix that
issue was only correct when the entire attribute was a set... In the
general case, an attribute can be a nested document, with only some
inner set. The equality-checking function needs to tranverse this nested
document, and compare the sets inside it as appropriate. This is what
we do in this patch.

This patch also adds a new test comparing equality of a nested document with
some inner sets. This test passes on DynamoDB, failed on Alternator before
this patch, and passes with this patch.

Refs #5021
Fixes #8514

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419184840.471858-1-nyh@scylladb.com>
(cherry picked from commit dae7528fe5)
2021-06-07 09:10:08 +03:00
Nadav Har'El
673f823d8b alternator: fix inequality check of two sets
In issue #5021 we noted that Alternator's equality operator needs to be
fixed for the case of comparing two sets, because the equality check needs
to take into account the possibility of different element order.

Unfortunately, we fixed only the equality check operator, but forgot there
is also an inequality operator!

So in this patch we fix the inequality operator, and also add a test for
it that was previously missing.

The implementation of the inequality operator is trivial - it's just the
negation of the equality test. Our pre-existing tests verify that this is
the correct implementation (e.g., if attribute x doesn't exist, then "x = 3"
is false but "x <> 3" is true).

Refs #5021
Fixes #8513

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419141450.464968-1-nyh@scylladb.com>
(cherry picked from commit 50f3201ee2)
2021-06-07 08:45:54 +03:00
Nadav Har'El
0082968bd8 alternator: fix equality check of two unset attributes
When a condition expression (ConditionExpression, FilterExpression, etc.)
checks for equality of two item attributes, i.e., "x = y", and when one of
these attributes was missing we correctly returned false.
However, we also need to return false when *both* attributes are missing in
the item, because this is what DynamoDB does in this case. In other words
an unset attribute is never equal to anything - not even to another unset
attribute. This was not happening before this patch:

When x and y were both missing attributes, Alternator incorrectly returned
true for "x = y", and this patch fixes this case. It also fixes "x <> y"
which should to be true when both x and y are unset (but was false
before this patch).

The other comparison operators - <, <=, >, >=, BETWEEN, were all
implemented correctly even before this patch.

This patch also includes tests for all the two-unset-attribute cases of
all the operators listed above. As usual, we check that these tests pass
on both DynamoDB and Alternator to confirm our new behavior is the correct
one - before this patch, two of the new tests failed on Alternator and
passed on DynamoDB.

Fixes #8511

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419123911.462579-1-nyh@scylladb.com>
(cherry picked from commit 46448b0983)
2021-06-06 16:28:27 +03:00
Takuya ASADA
542cd7aff1 scylla_raid_setup: use /dev/disk/by-uuid to specify filesystem
Currently, var-lib-scylla.mount may fails because it can start before
MDRAID volume initialized.
We may able to add "After=dev-disk-by\x2duuid-<uuid>.device" to wait for
device become available, but systemd manual says it automatically
configure dependency for mount unit when we specify filesystem path by
"absolute path of a device node".

So we need to replace What=UUID=<uuid> to What=/dev/disk/by-uuid/<uuid>.

Fixes #8279

Closes #8681

(cherry picked from commit 3d307919c3)
2021-05-24 17:24:07 +03:00
Raphael S. Carvalho
2b29568bf4 sstables/mp_row_consumer: Fix unbounded memory usage when consuming a large run of partition tombstones
mp_row_consumer will not stop consuming large run of partition
tombstones, until a live row is found which will allow the consumer
to stop proceeding. So partition tombstones, from a large run, are
all accumulated in memory, leading to OOM and stalls.
The fix is about stopping the consumer if buffer is full, to allow
the produced fragments to be consumed by sstable writer.

Fixes #8071.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210514202640.346594-1-raphaelsc@scylladb.com>


Upstream fix: db4b9215dd
2021-05-20 21:26:07 +03:00
Hagit Segev
93457807b8 release: prepare for 4.4.2 2021-05-20 00:02:31 +03:00
Takuya ASADA
cee62ab41b install.sh: apply correct file security context when copying files
Currently, unified installer does not apply correct file security context
while copying files, it causes permission error on scylla-server.service.
We should apply default file security context while copying files, using
'-Z' option on /usr/bin/install.

Also, because install -Z requires normalized path to apply correct security
context, use 'realpath -m <PATH>' on path variables on the script.

Fixes #8589

Closes #8602

(cherry picked from commit 60c0b37a4c)
2021-05-19 12:41:20 +03:00
Takuya ASADA
728a5e433f install.sh: fix not such file or directory on nonroot
Since we have added scylla-node-exporter, we needed to do 'install -d'
for systemd directory and sysconfig directory before copying files.

Fixes #8663

Closes #8664

(cherry picked from commit 6faa8b97ec)
2021-05-19 12:41:20 +03:00
Avi Kivity
9a2d4a7cc7 Merge 'Fix type checking in index paging' from Piotr Sarna
When recreating the paging state from an indexed query,
a bunch of panic checks were introduced to make sure that
the code is correct. However, one of the checks is too eager -
namely, it throws an error if the base column type is not equal
to the view column type. It usually works correctly, unless the
base column type is a clustering key with DESC clustering order,
in which case the type is actually "reversed". From the point of view
of the paging state generation it's not important, because both
types deserialize in the same way, so the check should be less
strict and allow the base type to be reversed.

Tests: unit(release), along with the additional test case
       introduced in this series; the test also passes
       on Cassandra

Fixes #8666

Closes #8667

* github.com:scylladb/scylla:
  test: add a test case for paging with desc clustering order
  cql3: relax a type check for index paging

(cherry picked from commit 593ad4de1e)
2021-05-19 12:41:05 +03:00
Takuya ASADA
cc050fd499 dist/redhat: stop using systemd macros, call systemctl directly
Fedora version of systemd macros does not work correctly on CentOS7,
since CentOS7 does not support "file trigger" feature.
To fix the issue we need to stop using systemd macros, call systemctl
directly.

See scylladb/scylla-jmx#94

Closes #8005

(cherry picked from commit 7b310c591e)
2021-05-18 13:50:07 +03:00
Raphael S. Carvalho
61145af5d9 compaction_manager: Don't swallow exception in procedure used by reshape and resharding
run_custom_job() was swallowing all exceptions, which is definitely
wrong because failure in a resharding or reshape would be incorrectly
interpreted as success, which means upper layer will continue as if
everything is ok. For example, ignoring a failure in resharding could
result in a shared sstable being left unresharded, so when that sstable
reaches a table, scylla would abort as shared ssts are no longer
accepted in the main sstable set.
Let's allow the exception to be propagated, so failure will be
communicated, and resharding and reshape will be all or nothing, as
originally intended.

Fixes #8657.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210515015721.384667-1-raphaelsc@scylladb.com>
(cherry picked from commit 10ae77966c)
2021-05-18 13:00:20 +03:00
Avi Kivity
11bd83e319 Update tools/jmx (rpm systemd macros)
* tools/jmx c510a56...7a101a0 (1):
  > dist/redhat: stop using systemd macros, call systemctl directly

Ref scylladb/jmx#94.
2021-05-13 18:24:52 +03:00
Raphael S. Carvalho
b58305d919 compaction_manager: Redefine weight for better control of parallel compactions
Compaction manager allows compaction of different weights to proceed in
parallel. For example, a small-sized compaction job can happen in parallel to a
large-sized one, but similar-sized jobs are serialized.

The problem is the current definition of weight, which is the log (base 4) of
total size (size of all sstables) of a job.

This is what we get with the current weight definition:
    weight=5	for sizes=[1K, 3K]
    weight=6	for sizes=[4K, 15K]
    weight=7	for sizes=[16K, 63K]
    weight=8	for sizes=[64K, 255K]
    weight=9	for sizes=[258K, 1019K]
    weight=10	for sizes=[1M, 3M]
    weight=11	for sizes=[4M, 15M]
    weight=12	for sizes=[16M, 63M]
    weight=13	for sizes=[64M, 254M]
    weight=14	for sizes=[256M, 1022M]
    weight=15	for sizes=[1033M, 4078M]
    weight=16	for sizes=[4119M, 10188M]
    total weights: 12

Note that for jobs smaller than 1MB, we have 5 different weights, meaning 5
jobs smaller than 1MB could proceed in parallel. High number of parallel
compactions can be observed after repair, which potentially produces tons of
small sstables of varying sizes. That causes compaction to use a significant
amount of resources.

To fix this problem, let's add a fixed tax to the size before taking the log,
so that jobs smaller than 1M will all have the same weight.

Look at what we get with the new weight definition:
    weight=10	for sizes=[1K, 2M]
    weight=11	for sizes=[3M, 14M]
    weight=12	for sizes=[15M, 62M]
    weight=13	for sizes=[63M, 254M]
    weight=14	for sizes=[256M, 1022M]
    weight=15	for sizes=[1033M, 4078M]
    weight=16	for sizes=[4119M, 10188M]
    total weights: 7

Fixes #8124.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210217123022.241724-1-raphaelsc@scylladb.com>
(cherry picked from commit 81d773e5d8)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210512224405.68925-1-raphaelsc@scylladb.com>
2021-05-13 08:38:40 +03:00
Lauro Ramos Venancio
065111b42b TWCS: initialize _highest_window_seen
The timestamp_type is an int64_t. So, it has to be explicitly
initialized before using it.

This missing inicialization prevented the major compactation
from happening when a time window finishes, as described in #8569.

Fixes #8569

Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com>

Closes #8590

(cherry picked from commit 15f72f7c9e)
2021-05-06 08:52:15 +03:00
Nadav Har'El
ebd2c9bab0 Update tools/java submodule
Backport sstableloader fix in tools/java submodule.
Fixes #8230.

* tools/java a3e010ee4f...6ca351c221 (1):
  > sstableloader: Handle non-prepared batches with ":" in identifier names

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-03 10:08:54 +03:00
Avi Kivity
bf9e1f6d2e Merge '[branch 4.4] Backport reader_permit: always forward resources to the semaphore ' from Botond Dénes
This is a backport of 8aaa3a7 to branch-4.4. The main conflicts were around Benny's reader close series (fa43d76), but it also turned out that an additional patch (2f1d65c) also has to backported to make sure admission on signaling resources doesn't deadlock.

Refs: #8493

Closes #8571

* github.com:scylladb/scylla:
  test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress
  test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units
  reader_concurrency_semaphore: add dump_diagnostics()
  reader_permit: always forward resources
  test: multishard_mutation_query_test: fuzzy-test: don't consume resource up-front
  reader_concurrency_semaphore: make admission conditions consistent
2021-04-30 22:02:46 +03:00
Botond Dénes
a710866235 test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress
This unit test checks that the semaphore doesn't get into a deadlock
when contended, in the presence of many memory-only reads (that don't
wait for admission). This is tested by simulating the 3 kind of reads we
currently have in the system:
* memory-only: reads that don't pass admission and only own memory.
* admitted: reads that pass admission.
* evictable: admitted reads that are furthermore evictable.

The test creates and runs a large number of these reads in parallel,
read kinds being selected randomly, then creates a watchdog which
kills the test if no progress is being made.

(cherry picked from commit 45d580f056)
2021-04-30 11:03:09 +03:00
Botond Dénes
3c3fc18777 test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units
This unit test passes a read through admission again-and-again, just
like an evictable reader would be during its lifetime. When readmitted
the read sometimes has to wait and sometimes not. This is to check that
the readmitting a previously admitted reader doesn't leak any units.

(cherry picked from commit cadc26de38)
2021-04-30 11:03:09 +03:00
Botond Dénes
960f93383b reader_concurrency_semaphore: add dump_diagnostics()
Allow semaphore related tests to include a diagnostics printout in error
messages to help determine why the test failed.

(cherry picked from commit d246e2df0a)
2021-04-30 09:08:18 +03:00
Botond Dénes
1c0557c638 reader_permit: always forward resources
This commit conceptually reverts 4c8ab10. Said commit was meant to
prevent the scenario where memory-only permits -- those that don't pass
admission but still consume memory -- completely prevent the admission
of reads, possibly even causing a deadlock because a permit might even
blocks its own admission. The protection introduced by said commit
however proved to be very problematic. It made the status of resources
on the permit very hard to reason about and created loopholes via which
permits could accumulate without tracking or they could even leak
resources. Instead of continuing to patch this broken system, this
commit does away with this "protection" based on the observation that
deadlocks are now prevented anyway by the admission criteria introduced
by 0fe75571d9, which admits a read anyway when all the initial count
resources are available (meaning no admitted reader is alive),
regardless of availability of memory.
The benefits of this revert is that the semaphore now knows about all
the resources and is able to do its job better as it is not "lied to"
about resource by the permits. Furthermore the status of a permit's
resources is much simpler to reason about, there are no more loopholes
in unexpected state transitions to swallow/leak resources.
To prove that this revert is indeed safe, in the next commit we add
robust tests that stress test admission on a highly contested semaphore.
This patch also does away with the registered/admitted differentiation
of permits, as this doesn't make much sense anymore, instead these two
are unified into a single "active" state. One can always tell whether a
permit was admitted or not from whether it owns count resources anyway.

(cherry picked from commit caaa8ef59a)
2021-04-30 09:08:17 +03:00
Botond Dénes
f23052ae64 test: multishard_mutation_query_test: fuzzy-test: don't consume resource up-front
The fuzzy test consumes a large chunk of resource from the semaphore
up-front to simulate a contested semaphore. This isn't an accurate
simulation, because no permit will have more than 1 units in reality.
Furthermore this can even cause a deadlock since 8aaa3a7 as now we rely
on all count units being available to make forward progress when memory
is scarce.
This patch just cuts out this part of the test, we now have a dedicated
unit test for checking a heavily contested semaphore, that does it
properly, so no need to try to fix this clumsy attempt that is just
making trouble at this point.

Refs: #8493

Tests: release(multishard_mutation_query_test:fuzzy_test)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429084458.40406-1-bdenes@scylladb.com>
(cherry picked from commit 26ae9555d1)
2021-04-30 08:57:12 +03:00
Botond Dénes
15a157611a reader_concurrency_semaphore: make admission conditions consistent
Currently there are two places where we check admission conditions:
`do_wait_admission()` and `signal()`. Both use `has_available_units()`
to check resource availability, but the former has some additional
resource related conditions on top (in `may_proceed()`), which lead to
the two paths working with slightly different conditions. To fix, push
down all resource availability related checks to `has_available_units()`
to ensure admission conditions are consistent across all paths.

(cherry picked from commit d90cd6402c)
2021-04-30 08:57:12 +03:00
Eliran Sinvani
d0b82e1e68 Materialized views: fix possibly old views comming from other nodes
Migration manager has a function to get a schema (for read or write),
this function queries a peer node and retrieves the schema from it. One
scenario where it can happen is if an old node, queries an old not fixed
index.
This makes a hole through which views that are only adjusted for reading
can slip through.

Here we plug the hole by fixing such views before they are registered.

Closes #8509

(cherry picked from commit 480a12d7b3)

Fixes #8554.
2021-04-29 14:03:03 +03:00
Botond Dénes
840ca41393 database: clear inactive reads in stop()
If any inactive read is left in the semaphore, it can block
`database::stop()` from shutting down, as sstables pinned by these reads
will prevent `sstables::sstables_manager::close()` from finishing. This
causes a deadlock.
It is not clear how inactive reads can be left in the semaphore, as all
users are supposed to clean up after themselves. Post 4.4 releases don't
have this problem anymore as the inactive read handle was made a RAII
object, removing the associated inactive read when destroyed. In 4.4 and
earlier release this wasn't so, so errors could be made. Normally this
is not a big issue, as these orphaned inactive reads are just evicted
when the resources they own are needed, but it does become a serious
issue during shutdown. To prevent a deadlock, clear the inactive reads
earlier, in `database::stop()` (currently they are cleared in the
destructor). This is a simple and foolproof way of ensuring any
leftover inactive reads don't cause problems.

Fixes: #8561

Tests: unit(dev)

Closes #8562
2021-04-28 19:32:46 +03:00
Takuya ASADA
07051f25f2 dist: increase fs.aio-max-nr value for other apps
Current fs.aio-max-nr value cpu_count() * 11026 is exact size of scylla
uses, if other apps on the environment also try to use aio, aio slot
will be run out.
So increase value +65536 for other apps.

Related #8133

Closes #8228

(cherry picked from commit 53c7600da8)
2021-04-25 16:15:25 +03:00
Takuya ASADA
8437f71b1b dist: tune fs.aio-max-nr based on the number of cpus
Current aio-max-nr is set up statically to 1048576 in
/etc/sysctl.d/99-scylla-aio.conf.
This is sufficient for most use cases, but falls short on larger machines
such as i3en.24xlarge on AWS that has 96 vCPUs.

We need to tune the parameter based on the number of cpus, instead of
static setting.

Fixes #8133

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #8188

(cherry picked from commit d0297c599a)
2021-04-25 16:15:12 +03:00
Avi Kivity
9f32f5a60c Update seastar submodule (io_queue request size)
* seastar 37eb6022fc...61939b5b8a (1):
  > io_queue: Double max request size

Fixes #8496
2021-04-25 12:35:34 +03:00
Avi Kivity
910bc2417a Update seastar submodule (low bandwidth disks)
* seastar a75171fc89...37eb6022fc (2):
  > io_queue: Honor disks with tiny request rate
  > io_queue: Shuffle fair_group creation

Fixes #8378.
2021-04-21 14:02:15 +03:00
Piotr Jastrzebski
7790beb655 row_cache: remove redundant check in make_reader
This check is always true because a dummy entry is added at the end of
each cache entry. If that wasn't true, the check in else-if would be
an UB.

Refs #8435.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit cb3dbb1a4b)
2021-04-20 13:53:23 +02:00
Piotr Jastrzebski
1379f141c2 cache_flat_mutation_reader: fix do_fill_buffer
Make sure that when a partition does not exist in underlying,
do_fill_buffer does not try to fast forward withing this nonexistent
partition.

Test: unit(dev)

Fixes #8435
Fixes #8411

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit 1f644df09d)
2021-04-20 13:53:17 +02:00
Piotr Jastrzebski
d14ec86e7d read_context: add _partition_exists
This new state stores the information whether current partition
represented by _key is present in underlying.

Refs #8435.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit ceab5f026d)
2021-04-20 13:53:10 +02:00
Piotr Jastrzebski
bbada5b9e4 read_context: remove skip_first_fragment arg from create_underlying
All callers pass false for its value so no need to keep it around.

Refs #8435.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit b3b68dc662)
2021-04-20 13:53:01 +02:00
Piotr Jastrzebski
d73ec88916 read_context: skip first fragment in ensure_underlying
This was previously done in create_underlying but ensure_underlying is
a better place because we will add more related logic to this
consumption in the following patches.

Refs #8435.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit 088a02aafd)
2021-04-20 13:52:46 +02:00
Kamil Braun
2efb458c7a time_series_sstable_set: return partition start if some sstables were ck-filtered out
When a particular partition exists in at least one sstable, the cache
expects any single-partition query to this partition to return a `partition_start`
fragment, even if the result is empty.

In `time_series_sstable_set::create_single_key_sstable_reader` it could
happen that all sstables containing data for the given query get
filtered out and only sstables without the relevant partition are left,
resulting in a reader which immediately returns end-of-stream (while it
should return a `partition_start` and if not in forwarding mode, a
`partition_end`). This commit fixes that.

We do it by extending the reader queue (used by the clustering reader
merger) with a `dummy_reader` which will be returned by the queue as
the very first reader. This reader only emits a `partition_start` and,
if not in forwarding mode, a `partition_end` fragment.

Fixes #8447.
Closes #8448.
(cherry picked from commit 5c7ed7a83f)
2021-04-20 13:52:34 +02:00
Kamil Braun
c05d8fcef1 clustering_order_reader_merger: handle empty readers
The merger could return end-of-stream if some (but not all) of the
underlying readers were empty (i.e. not even returning a
`partition_start`). This could happen in places where it was used
(`time_series_sstable_set::create_single_key_sstable_reader`) if we
opened an sstable which did not have the queried partition but passed
all the filters (specifically, the bloom filter returned a false
positive for this sstable).

The commit also extends the random tests for the merger to include empty
readers and adds an explicit test case that catches this bug (in a
limited scope: when we merge a single empty reader).

It also modifies `test_twcs_single_key_reader_filtering` (regression
test for #8432) because the time where the clustering key filter is
invoked changes (some invocations move from the constructor of the
merger to operator()). I checked manually that it still catches the bug
when I reintroduce it.

Fixes #8445.
Closes #8446.
(cherry picked from commit 7ffb0d826b)
2021-04-20 13:52:13 +02:00
Kamil Braun
d29960da47 sstables: fix TWCS single key reader sstable filter
The filter passed to `min_position_reader_queue`, which was used by
`clustering_order_reader_merger`, would incorrectly include sstables as
soon as they passed through the PK (bloom) filter, and would include
sstables which didn't pass the PK filter (if they passed the CK
filter). Fortunately this wouldn't cause incorrect data to be returned,
but it would cause sstables to be opened unnecessarily (these sstables
would immediately return eof), resulting in a performance drop. This commit
fixes the filter and adds a regression test which uses statistics to
check how many times the CK filter was invoked.

Fixes #8432.
Closes #8433.
(cherry picked from commit 3687757115)
2021-04-20 13:51:52 +02:00
Avi Kivity
e0d67ad6e4 Update seastar submodule (fair_queue fixes)
* seastar 2c884a7449...a75171fc89 (2):
  > fair_queue: Preempted requests got re-queued too far
  > fair_queue: Improve requests preemption while in pending state

Fixes #8296.
2021-04-14 15:40:45 +03:00
Hagit Segev
00da6b5e9e release: prepare for 4.4.1 2021-04-07 00:28:45 +03:00
Gleb Natapov
4200e52444 storage_proxy: do not crash on LOCAL_QUORUM access to a DC with zero replication
If a table that is not replicated to a certain DC (rf=0) is accessed
with LOCAL_QUORUM on that DC the current code will crash since the
'targets' array will be empty and read executor does not handle it.
Fix it by replying with empty result.

Fixes #8354

Message-Id: <YGro+l2En3fF80CO@scylladb.com>
(cherry picked from commit cd24dfc7e5)

[avi: re-added virtual keyword when backporting, since
      4.4 and below don't have 020da49c89]
2021-04-06 19:34:49 +03:00
Nadav Har'El
f5e402ea7a update tools/java submodule
Backport for refs #8390.

* tools/java 56470fda09...a3e010ee4f (1):
  > sstableloader: fix handling of rewritten partition

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-04-05 18:39:05 +03:00
Botond Dénes
05c6a40f05 result_memory_accounter: abort unpaged queries hitting the global limit
The `result_memory_accounter` terminates a query if it reaches either
the global or shard-local limit. This used to be so only for paged
queries, unpaged ones could grow indefinitely (until the node OOM'd).
This was changed in fea5067 which enforces the local limit on unpaged
queries as well, by aborting them. However a loophole remained in the
code: `result_memory_accounter::check_and_update()` has another stop
condition, besides `check_local_limit()`, it also checks the global
limit. This stop condition was not updated to enforce itself on unpaged
queries by aborting them, instead it silently terminated them, causing
them to return less data then requested. This was masked by most queries
reaching the local limit first.
This patch fixes this by aborting unpaged mutation queries when they hit
the global limit.

Fixes: #8162

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226102202.51275-1-bdenes@scylladb.com>
(cherry picked from commit dd5a601aaa)
2021-03-24 13:00:33 +02:00
Nadav Har'El
2b3bc9f174 Merge 'Fix reading whole requests during shedding' from Piotr Sarna
When shedding requests (e.g. due to their size or number exceeding the
limits), errors were returned right after parsing their headers, which
resulted in their bodies lingering in the socket. The server always
expects a correct request header when reading from the socket after the
processing of a single request is finished, so shedding the requests
should also take care of draining their bodies from the socket.

Fixes #8193

Closes #8194

* github.com:scylladb/scylla:
  cql-pytest: add a shedding test
  transport: return error on correct stream during size shedding
  transport: return error on correct stream during shedding
  transport: skip the whole request if it is too large
  transport: skip the whole request during shedding

(cherry picked from commit 0fea089b37)
2021-03-24 12:49:57 +02:00
Piotr Sarna
4bfa605c38 Merge 'Fix inconsistencies in MV and SI (reworked)' from Eliran Sinvani
This is a reworked submission of #7686 which has been reverted.  This series
fixes some race conditions in MV/SI schema creation and load, we spotted some
places where a schema without a base table reference can sneak into the
registry. This can cause to an unrecoverable error since write commands with
those schemas can't be issued from other nodes. Most of those cases can occur on
2 main and uncommon cases, in a mixed cluster (during an upgrade) and in a small
window after a view or base table altering.

Fixes #7709

Closes #8091

* github.com:scylladb/scylla:
  database: Fix view schemas in place when loading
  global_schema_ptr: add support for view's base table
  materialized views: create view schemas with proper base table reference.
  materialized views: Extract fix legacy schema into its own logic

(cherry picked from commit d473bc9b06)
2021-03-24 12:25:26 +02:00
Tomasz Grabiec
dbb550e1a7 sstable: writer: ka/la: Write row marker cell after row tombstone
Row marker has a cell name which sorts after the row tombstone's start
bound. The old code was writing the marker first, then the row
tombstone, which is incorrect.

This was harmeless to our sstable reader, which recognized both as
belonging to the current clustering row fragment, and collects both
fine.

However, if both atoms trigger creation of promoted index blocks, the
writer will create a promoted index with entries wich violate the cell
name ordering. It's very unlikely to run into in practice, since to
trigger promoted index entries for both atoms, the clustering key
would be so large so that the size of the marker cell exceeds the
desired promoted index block size, which is 64KB by default (but
user-controlled via column_index_size_in_kb option). 64KB is also the
limit on clustering key size accepted by the system.

This was caught by one of our unit tests:

  sstable_conforms_to_mutation_source_test

...which runs a battery of mutation reader tests with various
desired promoted index block sizes, including the target size of 1
byte, which triggers an entry for every atom.

The test started to fail for some random seeds after commit ecb6abe
inside the
test_streamed_mutation_forwarding_is_consistent_with_slicing test
case, reporting a mutation mismatch in the following line:

    assert_that(*sliced_m).is_equal_to(*fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key()));

It compares mutations read from the same sstable using different
methods, slicing using clustering key restricitons, and fast
forwarding. The reported mismatch was that fwd_m contained the row
marker, but sliced_m did not. The sstable does contain the marker, so
both reads should return it.

After reverting the commit which introduced dynamic adjustments, the
test passes, but both mutations are missing the marker, both are
wrong!

They are wrong because the promoted index contians entries whose
starting positions violate the ordering, so binary search gets confused
and selects the row tombstone's position, which is emitted after the
marker, thus skipping over the row marker.

The explanation for why the test started to fail after dynamic
adjustements is the following. The promoted index cursor works by
incrementally parsing buffers fed by the file input stream. It first
parses the whole block and then does a binary search within the parsed
array. The entries which cursor touches during binary search depend on
the size of the block read from the file. The commit which enabled
dynamic adjustements causes the block size to be different for
subsequent reads, which allows one of the reads to walk over the
corrupted entries and read the correct data by selecting the entry
corresponding to the row marker.

Fixes #8324
Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>

(cherry picked from commit 9272e74e8c)
2021-03-24 10:38:54 +02:00
Avi Kivity
dffbcabbb1 Merge "mutation_writer: explicitly close writers" from Benny
"
_consumer_fut is expected to return an exception
on the abort path.  Wait for it and drop any exception
so it won't be abandoned as seen in #7904.

A future<> close() method was added to return
_consumer_fut.  It is called both after abort()
in the error path, and after consume_end_of_stream,
on the success path.

With that, consume_end_of_stream was made void
as it doesn't return a future<> anymore.

Fixes #7904
Test: unit(release)
"

* tag 'close-bucket-writer-v5' of github.com:bhalevy/scylla:
  mutation_writer: bucket_writer: add close
  mutation_writer/feed_writers: refactor bucket/shard writers
  mutation_writer: update bucket/shard writers consume_end_of_stream

(cherry picked from commit f11a0700a8)
2021-03-21 18:09:45 +02:00
Avi Kivity
a715c27a7f Merge 'cdc: Limit size of topology description' from Piotr Jastrzębski
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.

This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.

This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.

This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.

For more details see #7961, #7993 and #7985.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #8048

* github.com:scylladb/scylla:
  cdc: Limit size of topology description
  cdc: Extract create_stream_ids from topology_description_generator

(cherry picked from commit c63e26e26f)
2021-03-21 14:05:36 +02:00
Benny Halevy
6804332291 dist: scylla_util: prevent IndexError when no ephemeral_disks were found
Currently we call firstNvmeSize before checking that we have enough
(at least 1) ephemeral disks.  When none are found, we hit the following
error (see #7971):
```
File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in
if idata.is_recommended_instance():
File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance
diskSize = self.firstNvmeSize
File "/opt/scylladb/scripts/scylla_util.py", line 291, in firstNvmeSize
firstDisk = ephemeral_disks[0]
IndexError: list index out of range
```

This change reverses the order and first checks that we found
enough disks before getting the fist disk size.

Fixes #7971

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8027

(cherry picked from commit 55e3df8a72)
2021-03-21 12:19:24 +02:00
Nadav Har'El
a20991ad62 storage_service: correct missing exception in logging rebuild failure
When failing to rebuild a node, we would print the error with the useless
explanation "<no exception>". The problem was a typo in the logging command
which used std::current_exception() - which wasn't relevant in that point -
instead of "ep".

Refs #8089

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>
(cherry picked from commit d73934372d)
2021-03-21 10:51:04 +02:00
Hagit Segev
ff585e0834 release: prepare for 4.4.0 2021-03-19 16:42:30 +02:00
Nadav Har'El
9a58deaaa2 alternator-test: increase read timeout and avoid retries
By default the boto3 library waits up to 60 second for a response,
and if got no response, it sends the same request again, multiple
times. We already noticed in the past that it retries too many times
thus slowing down failures, so in our test configuration lowered the
number of retries to 3, but the setting of 60-second-timeout plus
3 retries still causes two problems:

  1. When the test machine and the build are extremely slow, and the
     operation is long (usually, CreateTable or DeleteTable involving
     multiple views), the 60 second timeout might not be enough.

  2. If the timeout is reached, boto3 silently retries the same operation.
     This retry may fail because the previous one really succeeded at
     least partially! The symptom is tests which report an error when
     creating a table which already exists, or deleting a table which
     dooesn't exist.

The solution in this patch is first of all to never do retries - if
a query fails on internal server error, or times out, just report this
failure immediately. We don't expect to see transient errors during
local tests, so this is exactly the right behavior.
The second thing we do is to increase the default timeout. If 1 minute
was not enough, let's raise it to 5 minutes. 5 minutes should be enough
for every operation (famous last words...).

Even if 5 minutes is not enough for something, at least we'll now see
the timeout errors instead of some wierd errors caused by retrying an
operation which was already almost done.

Fixes #8135

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210222125630.1325011-1-nyh@scylladb.com>
(cherry picked from commit 0b2cf21932)
2021-03-19 00:08:27 +02:00
Raphael S. Carvalho
ee48ed2864 LCS: reshape: tolerate more sstables in level 0 with relaxed mode
Relaxed mode, used during initialization, of reshape only tolerates min_threshold
(default: 4) L0 sstables. However, relaxed mode should tolerate more sstables in
level 0, otherwise boot will have to reshape level 0 every time it crosses the
min threshold. So let's make LCS reshape tolerate a max of max_threshold and 32.
This change is beneficial because once table is populated, LCS regular compaction
can decide to merge those sstables in level 0 into level 1 instead, therefore
reducing WA.

Refs #8297.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210318131442.17935-1-raphaelsc@scylladb.com>
(cherry picked from commit e53cedabb1)
2021-03-18 19:19:46 +02:00
Raphael S. Carvalho
03f2eb529f compaction_manager: Fix performance of cleanup compaction due to unlimited parallelism
Prior to 463d0ab, only one table could be cleaned up at a time on a given shard.
Since then, all tables belonging to a given keyspace are cleaned up in parallel.
Cleanup serialization on each shard was enforced with a semaphore, which was
incorrectly removed by the patch aforementioned.

So space requirement for cleanup to succeed can be up to the size of keyspace,
increasing the chances of node running out of space.

Node could also run out of memory if there are tons of tables in the keyspace.
Memory requirement is at least #_of_tables * 128k (not taking into account write
behind, etc). With 5k tables, it's ~0.64G per shard.

Also all tables being cleaned up in parallel will compete for the same
disk and cpu bandwidth, so making them all much slower, and consequently
the operation time is significantly higher.

This problem was detected with cleanup, but scrub and upgrade go through the
same rewrite procedure, so they're affected by exact the same problem.

Fixes #8247.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>
(cherry picked from commit 7171244844)
2021-03-18 14:28:57 +02:00
Dejan Mircevski
c270014121 cql3/expr: Handle IN ? bound to null
Previously, we crashed when the IN marker is bound to null.  Throw
invalid_request_exception instead.

This is a 4.4 backport of the #8265 fix.

Tests: unit (dev)

(cherry picked from commit 8db24fc03b)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8307

Fixes #8265.
2021-03-18 10:35:55 +02:00
Pavel Emelyanov
35804855f9 test: Fix exit condition of row_cache_test::test_eviction_from_invalidated
The test populates the cache, then invalidates it, then tries to push
huge (10x times the segment size) chunks into seastar memory hoping that
the invalid entries will be evicted. The exit condition on the last
stage is -- total memory of the region (sum of both -- used and free)
becomes less than the size of one chunk.

However, the condition is wrong, because cache usually contains a dummy
entry that's not necessarily on lru and on some test iteration it may
happen that

  evictable size < chunk size < evictable size + dummy size

In this case test fails with bad_alloc being unable to evict the memory
from under the dummy.

fixes: #7959
tests: unit(row_cache_test), unit(the failing case with the triggering
       seed from the issue + 200 times more with random seeds)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210309134138.28099-1-xemul@scylladb.com>
(cherry picked from commit 096e452db9)
2021-03-16 23:42:11 +01:00
Piotr Sarna
a810e57684 Merge 'Alternator: support nested attribute paths...
in all expressions' from Nadav Har'El.

This series fixes #5024 - which is about adding support for nested attribute
paths (e.g., a.b.c[2]) to Alternator.  The series adds complete support for this
feature in ProjectionExpression, ConditionExpression, FilterExpression and
UpdateExpression - and also its combination with ReturnValues. Many relevant
tests - and also some new tests added in this series - now pass.

The first patch in the series fixes #8043 a bug in some error cases in
conditions, which was discovered while working in this series, and is
conceptually separate from the rest of the series.

Closes #8066

* github.com:scylladb/scylla:
  alternator: correct implemention of UpdateItem with nested attributes and ReturnValues
  alternator: fix bug in ReturnValues=UPDATED_NEW
  alternator: implemented nested attribute paths in UpdateExpression
  alternator: limit the depth of nested paths
  alternator: prepare for UpdateItem nested attribute paths
  alternator: overhaul ProjectionExpression hierarchy implementation
  alternator: make parsed::path object printable
  alternator-test: a few more ProjectionExpression conflict test cases
  alternator-test: improve tests for nested attributes in UpdateExpression
  alternator: support attribute paths in ConditionExpression, FilterExpression
  alternator-test: improve tests for nested attributes in ConditionExpression
  alternator: support attribute paths in ProjectionExpression
  alternator: overhaul attrs_to_get handling
  alternator-test: additional tests for attribute paths in ProjectionExpression
  alternator-test: harden attribute-path tests for ProjectionExpression
  alternator: fix ValidationException in FilterExpression - and more

(cherry picked from commit cbbb7f08a0)
2021-03-15 18:40:12 +02:00
Nadav Har'El
7b19cc17d6 updated tools/java submodule
* tools/java 8080009794...56470fda09 (1):
  > sstableloader: Only escape column names once

Backporting fix to Refs #8229.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-03-15 16:47:47 +02:00
Benny Halevy
101e0e611b storage_service: use atomic_vector for lifecycle_subscribers
So it can be modified while walked to dispatch
subscribed event notifications.

In #8143, there is a race between scylla shutdown and
notify_down(), causing use-after-free of cql_server.

Using an atomic vector itstead and futurizing
unregister_subscriber allows deleting from _lifecycle_subscribers
while walked using atomic_vector::for_each.

Fixes #8143

Test: unit(release)
DTest:
  update_cluster_layout_tests:TestUpdateClusterLayout.add_node_with_large_partition4_test(release)
  materialized_views_test.py:TestMaterializedViews.double_node_failure_during_mv_insert_4_nodes_test(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-2-bhalevy@scylladb.com>
(cherry picked from commit baf5d05631)
2021-03-15 15:25:18 +02:00
Benny Halevy
ba23eb733d cql_server: event_notifier: unregister_subscriber in stop
Move unregister_subscriber from the destructor to stop
as preparation for moving storage_service lifescyle_subscribers
to atomic_vector and futurizing unregister_subscriber.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-1-bhalevy@scylladb.com>
(cherry picked from commit 1ed04affab)

Ref #8143.
2021-03-15 15:25:10 +02:00
Hagit Segev
ec20ff0988 release: prepare for 4.4.rc4 2021-03-11 23:57:55 +02:00
Raphael S. Carvalho
3613b082bc compaction: Prevent cleanup and regular from compacting the same sstable
Due to regression introduced by 463d0ab, regular can compact in parallel a sstable
being compacted by cleanup, scrub or upgrade.

This redundancy causes resources to be wasted, write amplification is increased
and so does the operation time, etc.

That's a potential source of data resurrection because the now-owned data from
a sstable being compacted by both cleanup and regular will still exist in the
node afterwards, so resurrection can happen if node regains ownership.

Fixes #8155.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com>
(cherry picked from commit 2cf0c4bbf1)

Includes fixup patch:

compaction_manager: Fix use-after-free in rewrite_sstables()

Use-after-free introduced by 2cf0c4bbf1.
That's because compacting is moved into then_wrapped() lambda, so it's
potentially freed on the next iteration of repeat().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210309232940.433490-1-raphaelsc@scylladb.com>
(cherry picked from commit f7cc431477)
2021-03-11 08:24:01 +02:00
Asias He
b94208009f gossip: Handle timeout error in gossiper::do_shadow_round
Currently, the rpc timeout error for the GOSSIP_GET_ENDPOINT_STATES verb
is not handled in gossiper::do_shadow_round. If the
GOSSIP_GET_ENDPOINT_STATES rpc call to any of the remote nodes goes
timeout, gossiper::do_shadow_round will throw an exception and fail the
whole boot up process.

It is fine that some of the remote nodes timeout in shadow round. It is
not a must to talk to all nodes.

This patch fixes an issue we saw recently in our sct tests:

```
INFO    | scylla[1579]: [shard 0] init - Shutting down gossiping
INFO    | scylla[1579]: [shard 0] gossip - gossip is already stopped
INFO    | scylla[1579]: [shard 0] init - Shutting down gossiping was successful
...

ERR     | scylla[1579]: [shard 0] init - Startup failed: seastar::rpc::timeout_error (rpc call timed out)
```

Fixes #8187

Closes #8213

(cherry picked from commit dc40184faa)
2021-03-10 16:27:47 +02:00
Nadav Har'El
05c266c02a Merge 'Fix alternator streams management regression' from Calle Wilund
Refs: #8012
Fixes: #8210

With the update to CDC generation management, the way we retrieve and process these changed.
One very bad bug slipped through though; the code for getting versioned streams did not take into
account the late-in-pr change to make clustering of CDC gen timestamps reversed. So our alternator
shard info became quite rump-stumped, leading to more or less no data depending on when generations
changed w.r. data.

Also, the way we track the above timestamps changed, so we should utilize this for our end-of-iterator check.

Closes #8209

* github.com:scylladb/scylla:
  alternator::streams: Use better method for generation timestamp
  system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp
  system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range

(cherry picked from commit e12e57c915)
2021-03-10 16:27:43 +02:00
Avi Kivity
4b7319a870 Merge 'Split CDC streams table partitions into clustered rows ' from Kamil Braun
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments

This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.

The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.

Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.

---

Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.

However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. We add an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).

Closes #8116

* github.com:scylladb/scylla:
  tests: add a simple CDC cql pytest
  cdc: add config option to disable streams rewriting
  cdc: rewrite streams to the new description table
  cql3: query_processor: improve internal paged query API
  cdc: introduce no_generation_data_exception exception type
  docs: cdc: mention system.cdc_local table
  cdc: coroutinize do_update_streams_description
  sys_dist_ks: split CDC streams table partitions into clustered rows
  cdc: use chunked_vector for streams in streams_version
  cdc: remove `streams_version::expired` field
  system_distributed_keyspace: use mutation API to insert CDC streams
  storage_service: don't use `sys_dist_ks` before it is started

(cherry picked from commit f0950e023d)
2021-03-09 14:08:44 +02:00
Takuya ASADA
5ce71f3a29 scylla_raid_setup: don't abort using raiddev when array_state is 'clear'
On Ubuntu 20.04 AMI, scylla_raid_setup --raiddev /dev/md0 causes
'/dev/md0 is already using' (issue #7627).
So we merged the patch to find free mdX (587b909).

However, look into /proc/mdstat of the AMI, it actually says no active md device available:

ubuntu@ip-10-0-0-43:~$ cat /proc/mdstat
Personalities :
unused devices: <none>

We currently decide mdX is used when os.path.exists('/sys/block/mdX/md/array_state') == True,
but according to kernel doc, the file may available even array is STOPPED:

    clear

        No devices, no size, no level
        Writing is equivalent to STOP_ARRAY ioctl
https://www.kernel.org/doc/html/v4.15/admin-guide/md.html

So we should also check array_state != 'clear', not just array_state
existance.

Fixes #8219

Closes #8220

(cherry picked from commit 2d9feaacea)
2021-03-08 14:28:58 +02:00
Pekka Enberg
f06f4f6ee1 Update tools/jmx submodule
* tools/jmx 2c95650...c510a56 (1):
  > APIBuilder: Unlock RW-lock in remove()
2021-03-04 14:36:45 +02:00
Hagit Segev
c2d9247574 release: prepare for 4.4.rc3 2021-03-04 13:38:43 +02:00
Raphael S. Carvalho
b4e393d215 compaction: Fix leak of expired sstable in the backlog tracker
expired sstables are skipped in the compaction setup phase, because they don't
need to be actually compacted, but rather only deleted at the end.
that is causing such sstables to not be removed from the backlog tracker,
meaning that backlog caused by expired sstables will not be removed even after
their deletion, which means shares will be higher than needed, making compaction
potentially more aggressive than it have to.

to fix this bug, let's manually register these sstables into the monitor,
such that they'll be removed from the tracker once compaction completes.

Fixes #6054.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210216203700.189362-1-raphaelsc@scylladb.com>
(cherry picked from commit 5206a97915)
2021-03-01 14:14:33 +02:00
Avi Kivity
1a2b7037cd Update seastar submodule
* seastar 74ae29bc17...2c884a7449 (1):
  > io_queue: Fix "delay" metrics

Fixes #8166.
2021-03-01 13:57:04 +02:00
Raphael S. Carvalho
048f5efe1c sstables: Fix TWCS reshape for windows with at least min_threshold sstables
TWCS reshape was silently ignoring windows which contain at least
min_threshold sstables (can happen with data segregation).
When resizing candidates, size of multi_window was incorrectly used and
it was always empty in this path, which means candidates was always
cleared.

Fixes #8147.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com>
(cherry picked from commit 21608bd677)
2021-02-28 17:20:26 +02:00
Takuya ASADA
056293b95f dist/debian: don't run dh_installinit for scylla-node-exporter when service name == package name
dh_installinit --name <service> is for forcing install debian/*.service
and debian/*.default that does not matches with package name.
And if we have subpackages, packager has responsibility to rename
debian/*.service to debian/<subpackage>.*service.

However, we currently mistakenly running
dh_installinit --name scylla-node-exporter for
debian/scylla-node-exporeter.service,
the packaging system tries to find destination package for the .service,
and does not find subpackage name on it, so it will pick first
subpackage ordered by name, scylla-conf.

To solve the issue, we just need to run dh_installinit without --name
when $product == 'scylla'.

Fixes #8163

Closes #8164

(cherry picked from commit aabc67e386)
2021-02-28 17:20:26 +02:00
Takuya ASADA
f96ea8e011 scylla_setup: allow running scylla_setup with strict umask setting
We currently deny running scylla_setup when umask != 0022.
To remove this limitation, run os.chmod(0o644) on every file creation
to allow reading from scylla user.

Note that perftune.yaml is not really needed to set 0644 since perftune.py is
running in root user, but setting it to align permission with other files.

Fixes #8049

Closes #8119

(cherry picked from commit f3a82f4685)
2021-02-26 08:49:59 +02:00
Hagit Segev
49cd0b87f0 release: prepare for 4.4.rc2 2021-02-24 19:15:29 +02:00
Asias He
0977a73ab2 messaging_service: Move gossip ack message verb to gossip group
Fix a scheduling group leak:

INFO [shard 0] gossip - gossiper::run sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack_msg sg=statement
INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip

After the fix:

INFO [shard 0] gossip - gossiper::run sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack_msg sg=gossip
INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip

Fixes #7986

Closes #8129

(cherry picked from commit 7018377bd7)
2021-02-24 14:11:16 +02:00
Pekka Enberg
9fc582ee83 Update seastar submodule
* seastar 572536ef...74ae29bc (3):
  > perftune.py: fix assignment after extend and add asserts
  > scripts/perftune.py: convert nic option in old perftune.yaml to list for compatibility
  > scripts/perftune.py: remove repeated items after merging options from file
Fixes #7968.
2021-02-23 15:18:00 +02:00
Avi Kivity
4be14c2249 Revert "repair: Make removenode safe by default"
This reverts commit 829b4c1438. It ended
up causing repair failures.

Fixes #7965.
2021-02-23 14:14:07 +02:00
Tomasz Grabiec
3160dd4b59 table: Fix schema mismatch between memtable reader and sstable writer
The schema used to create the sstable writer has to be the same as the
schema used by the reader, as the former is used to intrpret mutation
fragments produced by the reader.

Commit 9124a70 intorduced a deferring point between reader creation
and writer creation which can result in schema mismatch if there was a
concurrent alter.

This could lead to the sstable write to crash, or generate a corrupted
sstable.

Fixes #7994

Message-Id: <20210222153149.289308-1-tgrabiec@scylladb.com>
2021-02-23 13:48:33 +02:00
Avi Kivity
50a8eab1a2 Update seastar submodule
* seastar a287bb1a3...572536ef4 (1):
  > rpc: streaming sink: order outgoing messages

Fixes #7552.
2021-02-23 10:19:45 +02:00
Avi Kivity
04615436a0 Point seastar submodule at scylla-seastar.git
This allows us to backport Seastar fixes to this branch.
2021-02-23 10:17:22 +02:00
Takuya ASADA
d1ab37654e scylla_util.py: resolve /dev/root to get actual device on aws
When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly
reports root partition is part of ephemeral disks, and RAID construction will
fail.
This prevents the error and reports correct free disks.

Fixes #8055

Closes #8040

(cherry picked from commit 32d4ec6b8a)
2021-02-21 16:22:51 +02:00
Nadav Har'El
b47bdb053d alternator: fix ValidationException in FilterExpression - and more
The first condition expressions we implemented in Alternator were the old
"Expected" syntax of conditional updates. That implementation had some
specific assumptions on how it handles errors: For example, in the "LT"
operator in "Expected", the second operand is always part of the query, so
an error in it (e.g., an unsupported type) resulted it a ValidationException
error.

When we implemented ConditionExpression and FilterExpression, we wrongly
used the same functions check_compare(), check_BETWEEN(), etc., to implement
them. This results in some inaccurate error handling. The worst example is
what happens when you use a FilterExpression with an expression such as
"x < y" - this filter is supposed to silently skip items whose "x" and "y"
attributes have unsupported or different types, but in our implementation
a bad type (e.g., a list) for y resulted in a ValidationException which
aborted the entire scan! Interestingly, in once case (that of BEGINS_WITH)
we actually noticed the slightly different behavior needed and implemented
the same operator twice - with ugly code duplication. But in other operators
we missed this problem completely.

This patch first adds extensive tests of how the different expressions
(Expected, QueryFilter, FilterExpression, ConditionExpression) and the
different operators handle various input errors - unsupported types,
missing items, incompatible types, etc. Importantly, the tests demonstrate
that there is often different behavior depending on whether the bad
input comes from the query, or from the item. Some of the new tests
fail before this patch, but others pass and were useful to verify that
the patch doesn't break anything that already worked correctly previously.
As usual, all the tests pass on Cassandra.

Finally, this patch *fixes* all these problems. The comparison functions
like check_compare() and check_BETWEEN() now not only take the operands,
they also take booleans saying if each of the operands came from the
query or from an item. The old-syntax caller (Expected or QueryFilter)
always say that the first operand is from the item and the second is
from the query - but in the new-syntax caller (ConditionExpression or
FilterExpression) any or all of the operands can come from the query
and need verification.

The old duplicated code for check_BEGINS_WITH() - which a TODO to remove
it - is finally removed. Instead we use the same idea of passing booleans
saying if each of its operands came from an item or from the query.

Fixes #8043

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 653610f4bc)
2021-02-21 09:25:01 +02:00
Piotr Sarna
e11ae8c58f test: fix a flaky timeout test depending on TTL
One of the USING TIMEOUT tests relied on a specific TTL value,
but that's fragile if the test runs on the boundary of 2 seconds.
Instead, the test case simply checks if the TTL value is present
and is greater than 0, which makes the test robust unless its execution
lasts for more than 1 million seconds, which is highly unlikely.

Fixes #8062

Closes #8063

(cherry picked from commit 2aa4631148)
2021-02-14 13:08:39 +02:00
Benny Halevy
e4132edef3 stream_session: prepare: fix missing string format argument
As seen in
mv_populating_from_existing_data_during_node_decommission_test dtest:
```
ERROR 2021-02-11 06:01:32,804 [shard 0] stream_session - failed to log message: fmt::v7::format_error (argument not found)
```

Fixes #8067

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210211100158.543952-1-bhalevy@scylladb.com>
(cherry picked from commit d01e7e7b58)
2021-02-14 13:08:20 +02:00
Shlomi Livne
492f0802fb scylla_io_setup did not configure pre tuned gce instances correctly
scylla_io_setup condition for nr_disks was using the bitwise operator
(&) instead of logical and operator (and) causing the io_properties
files to have incorrect values

Fixes #7341

Reviewed-by: Lubos Kosco <lubos@scylladb.com>
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>

Closes #8019

(cherry picked from commit 718976e794)
2021-02-14 13:08:00 +02:00
Takuya ASADA
34f22e1df1 dist/debian: install scylla-node-exporter.service correctly
node-exporter systemd unit name is "scylla-node-exporter.service", not
"node-exporter.service".

Fixes #8054

Closes #8053

(cherry picked from commit 856fe12e13)
2021-02-14 13:07:29 +02:00
Nadav Har'El
acb921845f cql-pytest: fix flaky timeuuid_test.py
The test timeuuid_test.py::testTimeuuid sporadically failed, and it turns out
the reason was a bug in the test - which this patch fixes.

The buggy test created a timeuuid and then compared the time stored in it
to the result of the dateOf() CQL function. The problem is that dateOf()
returns a CQL "timestamp", which has millisecond resolution, while the
timeuuid *may* have finer than millisecond resolution. The reason why this
test rarely failed is that in our implementation, the timeuuid almost
always gets a millisecond-resolution timestamp. Only if now() gets called
more than once in one millisecond, does it pick a higher time incremented
by less than a millisecond.

What this patch does is to truncate the time read from the timeuuid to
millisecond resolution, and only then compare it to the result of dateOf().
We cannot hope for more.

Fixes #8060

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210211165046.878371-1-nyh@scylladb.com>
(cherry picked from commit a03a8a89a9)
2021-02-14 13:06:59 +02:00
Botond Dénes
5b6c284281 query: use local limit for non-limited queries in mixed cluster
Since fea5067df we enforce a limit on the memory consumption of
otherwise non-limited queries like reverse and non-paged queries. This
limit is sent down to the replicas by the coordinator, ensuring that
each replica is working with the same limit. This however doesn't work
in a mixed cluster, when upgrading from a version which doesn't have
this series. This has been worked around by falling back to the old
max_result_size constant of 1MB in mixed clusters. This however resulted
in a regression when upgrading from a pre fea5067df to a post fea5067df
one. Pre fea5067df already had a limit for reverse queries, which was
generalized to also cover non-paged ones too by fea5067df.
The regression manifested in previously working reverse queries being
aborted. This happened because even though the user has set a generous
limit for them before the upgrade, in the mix cluster replicas fall back
to the much stricter 1MB limit temporarily ignoring the configured limit
if the coordinator is an old node. This patch solves this problem by
using the locally configured limit instead of the max_result_size
constant. This means that the user has to take extra care to configure
the same limit on all replicas, but at least they will have working
reverse queries during the upgrade.

Fixes: #8022

Tests: unit(release), manual test by user who reported the issue
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210209075947.1004164-1-bdenes@scylladb.com>
(cherry picked from commit 3d001b5587)
2021-02-09 18:06:43 +02:00
Yaron Kaikov
7d15319a8a release: prepare for 4.4
Update Docker parameters for the 4.4 release.

Closes #7932
2021-02-09 09:42:53 +02:00
Amnon Heiman
a06412fd24 API: Fix aggregation in column_familiy
Few method in column_familiy API were doing the aggregation wrong,
specifically, bloom filter disk size.

The issue is not always visible, it happens when there are multiple
filter files per shard.

Fixes #4513

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #8007

(cherry picked from commit 4498bb0a48)
2021-02-08 17:03:45 +02:00
Avi Kivity
2500dd1dc4 Merge 'dist/offline_installer/redhat: fix umask error' from Takuya ASADA
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.

Fixes #6243

Closes #6244

* github.com:scylladb/scylla:
  dist/offline_redhat: fix umask error
  dist/offline_installer/redhat: support cross build

(cherry picked from commit bb202db1ff)
2021-02-01 13:03:06 +02:00
Hagit Segev
fd868722dd release: prepare for 4.4.rc1 2021-01-31 14:09:44 +02:00
Pekka Enberg
f470c5d4de Update tools/python3 submodule
* tools/python3 c579207...199ac90 (1):
  > dist: debian: adjust .orig tarball name for .rc releases
2021-01-25 09:26:33 +02:00
Pekka Enberg
3677a72a21 Update tools/python3 submodule
* tools/python3 1763a1a...c579207 (1):
  > dist/debian: handle rc version correctly
2021-01-22 09:36:54 +02:00
Hagit Segev
46e6273821 release: prepare for 4.4.rc0 2021-01-18 20:29:53 +02:00
Jenkins Promoter
ce7e31013c release: prepare for 4.4 2021-01-18 15:49:55 +02:00
587 changed files with 14464 additions and 31201 deletions

View File

@@ -1,18 +1,13 @@
# Contributing to Scylla
# Contributing
## Asking questions or requesting help
Use the [Scylla Users mailing list](https://groups.google.com/g/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
Join the [Scylla Developers mailing list](https://groups.google.com/g/scylladb-dev) for deeper technical discussions and to discuss your ideas for contributions.
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
## Reporting an issue
Please use the [issue tracker](https://github.com/scylladb/scylla/issues/) to report issues or to suggest features. Fill in as much information as you can in the issue template, especially for performance problems.
Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues. Fill in as much information as you can in the issue template, especially for performance problems.
## Contributing code to Scylla
## Contributing Code to Scylla
Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form cla@scylladb.com. You can then submit your changes as patches to the to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).
If you need help formatting or sending patches, [check out these instructions](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches).
The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.
To contribute code to Scylla, you need to sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.

View File

@@ -1 +0,0 @@
Dedicated to the memory of Alberto José Araújo, a coworker and a friend.

View File

@@ -5,5 +5,3 @@ It includes files from https://github.com/antonblanchard/crc32-vpmsum (author An
These files are located in utils/arch/powerpc/crc32-vpmsum. Their license may be found in licenses/LICENSE-crc32-vpmsum.TXT.
It includes modified code from https://gitbox.apache.org/repos/asf?p=cassandra-dtest.git (owned by The Apache Software Foundation)
It includes modified tests from https://github.com/etcd-io/etcd.git (owned by The etcd Authors)

View File

@@ -42,7 +42,7 @@ For further information, please see:
* [Docker image build documentation] for information on how to build Docker images.
[developer documentation]: HACKING.md
[build documentation]: docs/guides/building.md
[build documentation]: docs/building.md
[docker image build documentation]: dist/docker/redhat/README.md
## Running Scylla
@@ -65,7 +65,7 @@ $ ./tools/toolchain/dbuild ./build/release/scylla --help
## Testing
See [test.py manual](docs/guides/testing.md).
See [test.py manual](docs/testing.md).
## Scylla APIs and compatibility
By default, Scylla is compatible with Apache Cassandra and its APIs - CQL and

View File

@@ -1,7 +1,7 @@
#!/bin/sh
PRODUCT=scylla
VERSION=4.5.7
VERSION=4.4.9
if test -f version
then

2
abseil

Submodule abseil updated: 9c6a50fdd8...1e3d25b265

View File

@@ -62,14 +62,6 @@ static std::string apply_sha256(std::string_view msg) {
return to_hex(hasher.finalize());
}
static std::string apply_sha256(const std::vector<temporary_buffer<char>>& msg) {
sha256_hasher hasher;
for (const temporary_buffer<char>& buf : msg) {
hasher.update(buf.get(), buf.size());
}
return to_hex(hasher.finalize());
}
static std::string format_time_point(db_clock::time_point tp) {
time_t time_point_repr = db_clock::to_time_t(tp);
std::string time_point_str;
@@ -99,7 +91,7 @@ void check_expiry(std::string_view signature_date) {
std::string get_signature(std::string_view access_key_id, std::string_view secret_access_key, std::string_view host, std::string_view method,
std::string_view orig_datestamp, std::string_view signed_headers_str, const std::map<std::string_view, std::string_view>& signed_headers_map,
const std::vector<temporary_buffer<char>>& body_content, std::string_view region, std::string_view service, std::string_view query_string) {
std::string_view body_content, std::string_view region, std::string_view service, std::string_view query_string) {
auto amz_date_it = signed_headers_map.find("x-amz-date");
if (amz_date_it == signed_headers_map.end()) {
throw api_error::invalid_signature("X-Amz-Date header is mandatory for signature verification");
@@ -137,7 +129,8 @@ future<std::string> get_key_from_roles(cql3::query_processor& qp, std::string us
auth::meta::roles_table::qualified_name, auth::meta::roles_table::role_col_name);
auto cl = auth::password_authenticator::consistency_for_user(username);
return qp.execute_internal(query, cl, auth::internal_distributed_query_state(), {sstring(username)}, true).then_wrapped([username = std::move(username)] (future<::shared_ptr<cql3::untyped_result_set>> f) {
auto& timeout = auth::internal_distributed_timeout_config();
return qp.execute_internal(query, cl, timeout, {sstring(username)}, true).then_wrapped([username = std::move(username)] (future<::shared_ptr<cql3::untyped_result_set>> f) {
auto res = f.get0();
auto salted_hash = std::optional<sstring>();
if (res->empty()) {

View File

@@ -39,7 +39,7 @@ using key_cache = utils::loading_cache<std::string, std::string>;
std::string get_signature(std::string_view access_key_id, std::string_view secret_access_key, std::string_view host, std::string_view method,
std::string_view orig_datestamp, std::string_view signed_headers_str, const std::map<std::string_view, std::string_view>& signed_headers_map,
const std::vector<temporary_buffer<char>>& body_content, std::string_view region, std::string_view service, std::string_view query_string);
std::string_view body_content, std::string_view region, std::string_view service, std::string_view query_string);
future<std::string> get_key_from_roles(cql3::query_processor& qp, std::string username);

View File

@@ -80,9 +80,6 @@ public:
static api_error trimmed_data_access_exception(std::string msg) {
return api_error("TrimmedDataAccessException", std::move(msg));
}
static api_error request_limit_exceeded(std::string msg) {
return api_error("RequestLimitExceeded", std::move(msg));
}
static api_error internal(std::string msg) {
return api_error("InternalServerError", std::move(msg), reply::status_type::internal_server_error);
}

View File

@@ -1998,7 +1998,7 @@ void attribute_path_map_add(const char* source, attribute_path_map<T>& map, cons
typename node::members_t& members = h->get_members();
auto it = members.find(member);
if (it == members.end()) {
it = members.insert({member, std::make_unique<node>()}).first;
it = members.insert({member, make_shared<node>()}).first;
}
h = it->second.get();
},
@@ -2013,7 +2013,7 @@ void attribute_path_map_add(const char* source, attribute_path_map<T>& map, cons
typename node::indexes_t& indexes = h->get_indexes();
auto it = indexes.find(index);
if (it == indexes.end()) {
it = indexes.insert({index, std::make_unique<node>()}).first;
it = indexes.insert({index, make_shared<node>()}).first;
}
h = it->second.get();
}
@@ -2442,8 +2442,8 @@ static bool hierarchy_actions(
if (newv) {
rjson::set_with_string_name(v, attr, std::move(*newv));
} else {
// Removing a.b when a is a map but a.b doesn't exist
// is silently ignored. It's not considered an error.
throw api_error::validation(format("Can't remove document path {} - not present in item",
subh.get_value()._path));
}
} else {
throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));
@@ -2817,7 +2817,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
struct table_requests {
schema_ptr schema;
db::consistency_level cl;
::shared_ptr<const attrs_to_get> attrs_to_get;
attrs_to_get attrs_to_get;
struct single_request {
partition_key pk;
clustering_key ck;
@@ -2832,7 +2832,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
tracing::add_table_name(trace_state, sstring(executor::KEYSPACE_NAME_PREFIX) + rs.schema->cf_name(), rs.schema->cf_name());
rs.cl = get_read_consistency(it->value);
std::unordered_set<std::string> used_attribute_names;
rs.attrs_to_get = ::make_shared<const attrs_to_get>(calculate_attrs_to_get(it->value, used_attribute_names));
rs.attrs_to_get = calculate_attrs_to_get(it->value, used_attribute_names);
verify_all_are_used(request, "ExpressionAttributeNames", used_attribute_names, "GetItem");
auto& keys = (it->value)["Keys"];
for (const rjson::value& key : keys.GetArray()) {
@@ -2862,7 +2862,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
future<std::tuple<std::string, std::optional<rjson::value>>> f = _proxy.query(rs.schema, std::move(command), std::move(partition_ranges), rs.cl,
service::storage_proxy::coordinator_query_options(executor::default_timeout(), permit, client_state, trace_state)).then(
[schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get] (service::storage_proxy::coordinator_query_result qr) mutable {
std::optional<rjson::value> json = describe_single_item(schema, partition_slice, *selection, *qr.query_result, *attrs_to_get);
std::optional<rjson::value> json = describe_single_item(schema, partition_slice, *selection, *qr.query_result, std::move(attrs_to_get));
return make_ready_future<std::tuple<std::string, std::optional<rjson::value>>>(
std::make_tuple(schema->cf_name(), std::move(json)));
});
@@ -3214,7 +3214,7 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
auto query_state_ptr = std::make_unique<service::query_state>(client_state, trace_state, std::move(permit));
command->slice.options.set<query::partition_slice::option::allow_short_read>();
auto query_options = std::make_unique<cql3::query_options>(cl, std::vector<cql3::raw_value>{});
auto query_options = std::make_unique<cql3::query_options>(cl, infinite_timeout_config, std::vector<cql3::raw_value>{});
query_options = std::make_unique<cql3::query_options>(std::move(query_options), std::move(paging_state));
auto p = service::pager::query_pagers::pager(schema, selection, *query_state_ptr, *query_options, command, std::move(partition_ranges), nullptr);
@@ -3909,6 +3909,22 @@ future<> executor::create_keyspace(std::string_view keyspace_name) {
});
}
static tracing::trace_state_ptr create_tracing_session() {
tracing::trace_state_props_set props;
props.set<tracing::trace_state_props::full_tracing>();
return tracing::tracing::get_local_tracing_instance().create_session(tracing::trace_type::QUERY, props);
}
tracing::trace_state_ptr executor::maybe_trace_query(client_state& client_state, sstring_view op, sstring_view query) {
tracing::trace_state_ptr trace_state;
if (tracing::tracing::get_local_tracing_instance().trace_next_query()) {
trace_state = create_tracing_session();
tracing::add_query(trace_state, query);
tracing::begin(trace_state, format("Alternator {}", op), client_state.get_client_address());
}
return trace_state;
}
future<> executor::start() {
// Currently, nothing to do on initialization. We delay the keyspace
// creation (create_keyspace()) until a table is actually created.

View File

@@ -53,10 +53,6 @@ namespace service {
class storage_service;
}
namespace cdc {
class metadata;
}
namespace alternator {
class rmw_operation;
@@ -109,12 +105,17 @@ template<typename T>
class attribute_path_map_node {
public:
using data_t = T;
// We need the extra unique_ptr<> here because libstdc++ unordered_map
// doesn't work with incomplete types :-(
using members_t = std::unordered_map<std::string, std::unique_ptr<attribute_path_map_node<T>>>;
// We need the extra shared_ptr<> here because libstdc++ unordered_map
// doesn't work with incomplete types :-( We couldn't use lw_shared_ptr<>
// because it doesn't work for incomplete types either. We couldn't use
// std::unique_ptr<> because it makes the entire object uncopyable. We
// don't often need to copy such a map, but we do have some code that
// copies an attrs_to_get object, and is hard to find and remove.
// The shared_ptr should never be null.
using members_t = std::unordered_map<std::string, seastar::shared_ptr<attribute_path_map_node<T>>>;
// The indexes list is sorted because DynamoDB requires handling writes
// beyond the end of a list in index order.
using indexes_t = std::map<unsigned, std::unique_ptr<attribute_path_map_node<T>>>;
using indexes_t = std::map<unsigned, seastar::shared_ptr<attribute_path_map_node<T>>>;
// The prohibition on "overlap" and "conflict" explained above means
// That only one of data, members or indexes is non-empty.
std::optional<std::variant<data_t, members_t, indexes_t>> _content;
@@ -144,7 +145,6 @@ class executor : public peering_sharded_service<executor> {
service::migration_manager& _mm;
db::system_distributed_keyspace& _sdks;
service::storage_service& _ss;
cdc::metadata& _cdc_metadata;
// An smp_service_group to be used for limiting the concurrency when
// forwarding Alternator request between shards - if necessary for LWT.
smp_service_group _ssg;
@@ -157,8 +157,8 @@ public:
static constexpr auto KEYSPACE_NAME_PREFIX = "alternator_";
static constexpr std::string_view INTERNAL_TABLE_PREFIX = ".scylla.alternator.";
executor(service::storage_proxy& proxy, service::migration_manager& mm, db::system_distributed_keyspace& sdks, service::storage_service& ss, cdc::metadata& cdc_metadata, smp_service_group ssg)
: _proxy(proxy), _mm(mm), _sdks(sdks), _ss(ss), _cdc_metadata(cdc_metadata), _ssg(ssg) {}
executor(service::storage_proxy& proxy, service::migration_manager& mm, db::system_distributed_keyspace& sdks, service::storage_service& ss, smp_service_group ssg)
: _proxy(proxy), _mm(mm), _sdks(sdks), _ss(ss), _ssg(ssg) {}
future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
@@ -187,6 +187,8 @@ public:
future<> create_keyspace(std::string_view keyspace_name);
static tracing::trace_state_ptr maybe_trace_query(client_state& client_state, sstring_view op, sstring_view query);
static sstring table_name(const schema&);
static db::timeout_clock::time_point default_timeout();
static void set_default_timeout(db::timeout_clock::duration timeout);

View File

@@ -22,8 +22,6 @@
#include "alternator/server.hh"
#include "log.hh"
#include <seastar/http/function_handlers.hh>
#include <seastar/http/short_streams.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/json/json_elements.hh>
#include "seastarx.hh"
#include "error.hh"
@@ -61,40 +59,6 @@ inline std::vector<std::string_view> split(std::string_view text, char separator
return tokens;
}
// Handle CORS (Cross-origin resource sharing) in the HTTP request:
// If the request has the "Origin" header specifying where the script which
// makes this request comes from, we need to reply with the header
// "Access-Control-Allow-Origin: *" saying that this (and any) origin is fine.
// Additionally, if preflight==true (i.e., this is an OPTIONS request),
// the script can also "request" in headers that the server allows it to use
// some HTTP methods and headers in the followup request, and the server
// should respond by "allowing" them in the response headers.
// We also add the header "Access-Control-Expose-Headers" to let the script
// access additional headers in the response.
// This handle_CORS() should be used when handling any HTTP method - both the
// usual GET and POST, and also the "preflight" OPTIONS method.
static void handle_CORS(const request& req, reply& rep, bool preflight) {
if (!req.get_header("origin").empty()) {
rep.add_header("Access-Control-Allow-Origin", "*");
// This is the list that DynamoDB returns for expose headers. I am
// not sure why not just return "*" here, what's the risk?
rep.add_header("Access-Control-Expose-Headers", "x-amzn-RequestId,x-amzn-ErrorType,x-amzn-ErrorMessage,Date");
if (preflight) {
sstring s = req.get_header("Access-Control-Request-Headers");
if (!s.empty()) {
rep.add_header("Access-Control-Allow-Headers", std::move(s));
}
s = req.get_header("Access-Control-Request-Method");
if (!s.empty()) {
rep.add_header("Access-Control-Allow-Methods", std::move(s));
}
// Our CORS response never change anyway, let the browser cache it
// for two hours (Chrome's maximum):
rep.add_header("Access-Control-Max-Age", "7200");
}
}
}
// DynamoDB HTTP error responses are structured as follows
// https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html
// Our handlers throw an exception to report an error. If the exception
@@ -150,7 +114,6 @@ public:
api_handler(const api_handler&) = default;
future<std::unique_ptr<reply>> handle(const sstring& path,
std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
handle_CORS(*req, *rep, false);
return _f_handle(std::move(req), std::move(rep)).then(
[this](std::unique_ptr<reply> rep) {
rep->set_mime_type("application/x-amz-json-1.0");
@@ -187,7 +150,6 @@ public:
health_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}
protected:
virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
handle_CORS(*req, *rep, false);
rep->set_status(reply::status_type::ok);
rep->write_body("txt", format("healthy: {}", req->get_header("Host")));
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
@@ -220,23 +182,7 @@ protected:
}
};
// The CORS (Cross-origin resource sharing) protocol can send an OPTIONS
// request before ("pre-flight") the main request. The response to this
// request can be empty, but needs to have the right headers (which we
// fill with handle_CORS())
class options_handler : public gated_handler {
public:
options_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}
protected:
virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
handle_CORS(*req, *rep, true);
rep->set_status(reply::status_type::ok);
rep->write_body("txt", sstring(""));
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
}
};
future<> server::verify_signature(const request& req, const chunked_content& content) {
future<> server::verify_signature(const request& req) {
if (!_enforce_authorization) {
slogger.debug("Skipping authorization");
return make_ready_future<>();
@@ -312,7 +258,7 @@ future<> server::verify_signature(const request& req, const chunked_content& con
auto cache_getter = [&qp = _qp] (std::string username) {
return get_key_from_roles(qp, std::move(username));
};
return _key_cache.get_ptr(user, cache_getter).then([this, &req, &content,
return _key_cache.get_ptr(user, cache_getter).then([this, &req,
user = std::move(user),
host = std::move(host),
datestamp = std::move(datestamp),
@@ -322,7 +268,7 @@ future<> server::verify_signature(const request& req, const chunked_content& con
service = std::move(service),
user_signature = std::move(user_signature)] (key_cache::value_ptr key_ptr) {
std::string signature = get_signature(user, *key_ptr, std::string_view(host), req._method,
datestamp, signed_headers_str, signed_headers_map, content, region, service, "");
datestamp, signed_headers_str, signed_headers_map, req.content, region, service, "");
if (signature != std::string_view(user_signature)) {
_key_cache.remove(user);
@@ -331,91 +277,43 @@ future<> server::verify_signature(const request& req, const chunked_content& con
});
}
static tracing::trace_state_ptr create_tracing_session(tracing::tracing& tracing_instance) {
tracing::trace_state_props_set props;
props.set<tracing::trace_state_props::full_tracing>();
props.set_if<tracing::trace_state_props::log_slow_query>(tracing_instance.slow_query_tracing_enabled());
return tracing_instance.create_session(tracing::trace_type::QUERY, props);
}
// truncated_content_view() prints a potentially long chunked_content for
// debugging purposes. In the common case when the content is not excessively
// long, it just returns a view into the given content, without any copying.
// But when the content is very long, it is truncated after some arbitrary
// max_len (or one chunk, whichever comes first), with "<truncated>" added at
// the end. To do this modification to the string, we need to create a new
// std::string, so the caller must pass us a reference to one, "buf", where
// we can store the content. The returned view is only alive for as long this
// buf is kept alive.
static std::string_view truncated_content_view(const chunked_content& content, std::string& buf) {
constexpr size_t max_len = 1024;
if (content.empty()) {
return std::string_view();
} else if (content.size() == 1 && content.begin()->size() <= max_len) {
return std::string_view(content.begin()->get(), content.begin()->size());
} else {
buf = std::string(content.begin()->get(), std::min(content.begin()->size(), max_len)) + "<truncated>";
return std::string_view(buf);
}
}
static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_state, sstring_view op, const chunked_content& query) {
tracing::trace_state_ptr trace_state;
tracing::tracing& tracing_instance = tracing::tracing::get_local_tracing_instance();
if (tracing_instance.trace_next_query() || tracing_instance.slow_query_tracing_enabled()) {
trace_state = create_tracing_session(tracing_instance);
std::string buf;
tracing::add_session_param(trace_state, "alternator_op", op);
tracing::add_query(trace_state, truncated_content_view(query, buf));
tracing::begin(trace_state, format("Alternator {}", op), client_state.get_client_address());
}
return trace_state;
}
future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request> req) {
future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request>&& req) {
_executor._stats.total_operations++;
sstring target = req->get_header(TARGET);
std::vector<std::string_view> split_target = split(target, '.');
//NOTICE(sarna): Target consists of Dynamo API version followed by a dot '.' and operation type (e.g. CreateTable)
std::string op = split_target.empty() ? std::string() : std::string(split_target.back());
// JSON parsing can allocate up to roughly 2x the size of the raw
// document, + a couple of bytes for maintenance.
// TODO: consider the case where req->content_length is missing. Maybe
// we need to take the content_length_limit and return some of the units
// when we finish read_content_and_verify_signature?
size_t mem_estimate = req->content_length * 2 + 8000;
auto units_fut = get_units(*_memory_limiter, mem_estimate);
if (_memory_limiter->waiters()) {
++_executor._stats.requests_blocked_memory;
}
auto units = co_await std::move(units_fut);
assert(req->content_stream);
chunked_content content = co_await httpd::read_entire_stream(*req->content_stream);
co_await verify_signature(*req, content);
if (slogger.is_enabled(log_level::trace)) {
std::string buf;
slogger.trace("Request: {} {} {}", op, truncated_content_view(content, buf), req->_headers);
}
auto callback_it = _callbacks.find(op);
if (callback_it == _callbacks.end()) {
_executor._stats.unsupported_operations++;
co_return api_error::unknown_operation(format("Unsupported operation {}", op));
}
if (_pending_requests.get_count() >= _max_concurrent_requests) {
_executor._stats.requests_shed++;
co_return api_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}", _pending_requests.get_count()));
}
_pending_requests.enter();
auto leave = defer([this] { _pending_requests.leave(); });
//FIXME: Client state can provide more context, e.g. client's endpoint address
// We use unique_ptr because client_state cannot be moved or copied
executor::client_state client_state{executor::client_state::internal_tag()};
tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, op, content);
tracing::trace(trace_state, op);
rjson::value json_request = co_await _json_parser.parse(std::move(content));
co_return co_await callback_it->second(_executor, client_state, trace_state,
make_service_permit(std::move(units)), std::move(json_request), std::move(req));
slogger.trace("Request: {} {} {}", op, req->content, req->_headers);
return verify_signature(*req).then([this, op, req = std::move(req)] () mutable {
auto callback_it = _callbacks.find(op);
if (callback_it == _callbacks.end()) {
_executor._stats.unsupported_operations++;
throw api_error::unknown_operation(format("Unsupported operation {}", op));
}
return with_gate(_pending_requests, [this, callback_it = std::move(callback_it), op = std::move(op), req = std::move(req)] () mutable {
//FIXME: Client state can provide more context, e.g. client's endpoint address
// We use unique_ptr because client_state cannot be moved or copied
return do_with(std::make_unique<executor::client_state>(executor::client_state::internal_tag()),
[this, callback_it = std::move(callback_it), op = std::move(op), req = std::move(req)] (std::unique_ptr<executor::client_state>& client_state) mutable {
tracing::trace_state_ptr trace_state = executor::maybe_trace_query(*client_state, op, req->content);
tracing::trace(trace_state, op);
// JSON parsing can allocate up to roughly 2x the size of the raw document, + a couple of bytes for maintenance.
// FIXME: by this time, the whole HTTP request was already read, so some memory is already occupied.
// Once HTTP allows working on streams, we should grab the permit *before* reading the HTTP payload.
size_t mem_estimate = req->content.size() * 3 + 8000;
auto units_fut = get_units(*_memory_limiter, mem_estimate);
if (_memory_limiter->waiters()) {
++_executor._stats.requests_blocked_memory;
}
return units_fut.then([this, callback_it = std::move(callback_it), &client_state, trace_state, req = std::move(req)] (semaphore_units<> units) mutable {
return _json_parser.parse(req->content).then([this, callback_it = std::move(callback_it), &client_state, trace_state,
units = std::move(units), req = std::move(req)] (rjson::value json_request) mutable {
return callback_it->second(_executor, *client_state, trace_state, make_service_permit(std::move(units)), std::move(json_request), std::move(req)).finally([trace_state] {});
});
});
});
});
});
}
void server::set_routes(routes& r) {
@@ -437,7 +335,6 @@ void server::set_routes(routes& r) {
// scan an entire subnet for nodes responding to the health request,
// or even just scan for open ports.
r.put(operation_type::GET, "/localnodes", new local_nodelist_handler(_pending_requests));
r.put(operation_type::OPTIONS, "/", new options_handler(_pending_requests));
}
//FIXME: A way to immediately invalidate the cache should be considered,
@@ -520,10 +417,9 @@ server::server(executor& exec, cql3::query_processor& qp)
}
future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {
bool enforce_authorization, semaphore* memory_limiter) {
_memory_limiter = memory_limiter;
_enforce_authorization = enforce_authorization;
_max_concurrent_requests = std::move(max_concurrent_requests);
if (!port && !https_port) {
return make_exception_future<>(std::runtime_error("Either regular port or TLS port"
" must be specified in order to init an alternator HTTP server instance"));
@@ -535,14 +431,12 @@ future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std:
if (port) {
set_routes(_http_server._routes);
_http_server.set_content_length_limit(server::content_length_limit);
_http_server.set_content_streaming(true);
_http_server.listen(socket_address{addr, *port}).get();
_enabled_servers.push_back(std::ref(_http_server));
}
if (https_port) {
set_routes(_https_server._routes);
_https_server.set_content_length_limit(server::content_length_limit);
_https_server.set_content_streaming(true);
_https_server.set_tls_credentials(creds->build_reloadable_server_credentials([](const std::unordered_set<sstring>& files, std::exception_ptr ep) {
if (ep) {
slogger.warn("Exception loading {}: {}", files, ep);
@@ -580,7 +474,7 @@ server::json_parser::json_parser() : _run_parse_json_thread(async([this] {
return;
}
try {
_parsed_document = rjson::parse_yieldable(std::move(_raw_document));
_parsed_document = rjson::parse_yieldable(_raw_document);
_current_exception = nullptr;
} catch (...) {
_current_exception = std::current_exception();
@@ -590,12 +484,12 @@ server::json_parser::json_parser() : _run_parse_json_thread(async([this] {
})) {
}
future<rjson::value> server::json_parser::parse(chunked_content&& content) {
future<rjson::value> server::json_parser::parse(std::string_view content) {
if (content.size() < yieldable_parsing_threshold) {
return make_ready_future<rjson::value>(rjson::parse(std::move(content)));
return make_ready_future<rjson::value>(rjson::parse(content));
}
return with_semaphore(_parsing_sem, 1, [this, content = std::move(content)] () mutable {
_raw_document = std::move(content);
return with_semaphore(_parsing_sem, 1, [this, content] {
_raw_document = content;
_document_waiting.signal();
return _document_parsed.wait().then([this] {
if (_current_exception) {

View File

@@ -28,13 +28,10 @@
#include <optional>
#include "alternator/auth.hh"
#include "utils/small_vector.hh"
#include "utils/updateable_value.hh"
#include <seastar/core/units.hh>
namespace alternator {
using chunked_content = rjson::chunked_content;
class server {
static constexpr size_t content_length_limit = 16*MB;
using alternator_callback = std::function<future<executor::request_return_type>(executor&, executor::client_state&,
@@ -53,11 +50,10 @@ class server {
alternator_callbacks_map _callbacks;
semaphore* _memory_limiter;
utils::updateable_value<uint32_t> _max_concurrent_requests;
class json_parser {
static constexpr size_t yieldable_parsing_threshold = 16*KB;
chunked_content _raw_document;
std::string_view _raw_document;
rjson::value _parsed_document;
std::exception_ptr _current_exception;
semaphore _parsing_sem{1};
@@ -67,10 +63,7 @@ class server {
future<> _run_parse_json_thread;
public:
json_parser();
// Moving a chunked_content into parse() allows parse() to free each
// chunk as soon as it is parsed, so when chunks are relatively small,
// we don't need to store the sum of unparsed and parsed sizes.
future<rjson::value> parse(chunked_content&& content);
future<rjson::value> parse(std::string_view content);
future<> stop();
};
json_parser _json_parser;
@@ -79,12 +72,12 @@ public:
server(executor& executor, cql3::query_processor& qp);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
bool enforce_authorization, semaphore* memory_limiter);
future<> stop();
private:
void set_routes(seastar::httpd::routes& r);
future<> verify_signature(const seastar::httpd::request&, const chunked_content&);
future<executor::request_return_type> handle_api_request(std::unique_ptr<request> req);
future<> verify_signature(const seastar::httpd::request& r);
future<executor::request_return_type> handle_api_request(std::unique_ptr<request>&& req);
};
}

View File

@@ -97,8 +97,6 @@ stats::stats() : api_operations{} {
seastar::metrics::description("number writes that had to be bounced from this shard because of LWT requirements")),
seastar::metrics::make_total_operations("requests_blocked_memory", requests_blocked_memory,
seastar::metrics::description("Counts a number of requests blocked due to memory pressure.")),
seastar::metrics::make_total_operations("requests_shed", requests_shed,
seastar::metrics::description("Counts a number of requests shed due to overload.")),
seastar::metrics::make_total_operations("filtered_rows_read_total", cql_stats.filtered_rows_read_total,
seastar::metrics::description("number of rows read during filtering operations")),
seastar::metrics::make_total_operations("filtered_rows_matched_total", cql_stats.filtered_rows_matched_total,

View File

@@ -92,7 +92,6 @@ public:
uint64_t write_using_lwt = 0;
uint64_t shard_bounce_for_lwt = 0;
uint64_t requests_blocked_memory = 0;
uint64_t requests_shed = 0;
// CQL-derived stats
cql3::cql_stats cql_stats;
private:

View File

@@ -34,7 +34,6 @@
#include "cdc/log.hh"
#include "cdc/generation.hh"
#include "cdc/cdc_options.hh"
#include "cdc/metadata.hh"
#include "db/system_distributed_keyspace.hh"
#include "utils/UUID_gen.hh"
#include "cql3/selection/selection.hh"
@@ -471,7 +470,8 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
auto status = "DISABLED";
if (opts.enabled()) {
if (!_cdc_metadata.streams_available()) {
auto& metadata = _ss.get_cdc_metadata();
if (!metadata.streams_available()) {
status = "ENABLING";
} else {
status = "ENABLED";

View File

@@ -76,7 +76,7 @@
"items":{
"type":"message_counter"
},
"nickname":"get_replied_messages",
"nickname":"get_completed_messages",
"produces":[
"application/json"
],

View File

@@ -104,68 +104,6 @@
}
]
},
{
"path":"/storage_service/toppartitions/",
"operations":[
{
"method":"GET",
"summary":"Toppartitions query",
"type":"toppartitions_query_results",
"nickname":"toppartitions_generic",
"produces":[
"application/json"
],
"parameters":[
{
"name":"table_filters",
"description":"Optional list of table name filters in keyspace:name format",
"required":false,
"allowMultiple":false,
"type":"array",
"items":{
"type":"string"
},
"paramType":"query"
},
{
"name":"keyspace_filters",
"description":"Optional list of keyspace filters",
"required":false,
"allowMultiple":false,
"type":"array",
"items":{
"type":"string"
},
"paramType":"query"
},
{
"name":"duration",
"description":"Duration (in milliseconds) of monitoring operation",
"required":true,
"allowMultiple":false,
"type": "long",
"paramType":"query"
},
{
"name":"list_size",
"description":"number of the top partitions to list",
"required":false,
"allowMultiple":false,
"type": "long",
"paramType":"query"
},
{
"name":"capacity",
"description":"capacity of stream summary: determines amount of resources used in query processing",
"required":false,
"allowMultiple":false,
"type": "long",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/nodes/leaving",
"operations":[
@@ -1032,14 +970,6 @@
"type":"string",
"paramType":"query"
},
{
"name":"ignore_nodes",
"description":"Which hosts are to ignore in this repair. Multiple hosts can be listed separated by commas.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"trace",
"description":"If the value is the string 'true' with any capitalization, enable tracing of the repair.",
@@ -1175,14 +1105,6 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"ignore_nodes",
"description":"List of dead nodes to ingore in removenode operation",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
@@ -1834,22 +1756,6 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"load_and_stream",
"description":"Load the sstables and stream to all replica nodes that owns the data",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"primary_replica_only",
"description":"Load the sstables and stream to primary replica node that owns the data. Repair is needed after the load and stream process",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
@@ -1960,14 +1866,6 @@
"allowMultiple":false,
"type":"long",
"paramType":"query"
},
{
"name":"fast",
"description":"Lightweight tracing mode: if true, slow queries tracing records only session headers",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
},
@@ -2466,10 +2364,6 @@
"threshold":{
"type":"long",
"description":"The slow query logging threshold in microseconds. Queries that takes longer, will be logged"
},
"fast":{
"type":"boolean",
"description":"Is lightweight tracing mode enabled. In that mode tracing ignore events and tracks only sessions."
}
}
},

View File

@@ -52,22 +52,6 @@
}
]
},
{
"path":"/system/drop_sstable_caches",
"operations":[
{
"method":"POST",
"summary":"Drop in-memory caches for data which is in sstables",
"type":"void",
"nickname":"drop_sstable_caches",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/system/uptime_ms",
"operations":[

View File

@@ -28,7 +28,6 @@
#include <algorithm>
#include "db/system_keyspace_view_types.hh"
#include "db/data_listeners.hh"
#include "storage_service.hh"
extern logging::logger apilog;
@@ -181,7 +180,7 @@ static future<json::json_return_type> get_cf_unleveled_sstables(http_context& ct
static int64_t min_partition_size(column_family& cf) {
int64_t res = INT64_MAX;
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
for (auto i: *cf.get_sstables() ) {
res = std::min(res, i->get_stats_metadata().estimated_partition_size.min());
}
return (res == INT64_MAX) ? 0 : res;
@@ -189,7 +188,7 @@ static int64_t min_partition_size(column_family& cf) {
static int64_t max_partition_size(column_family& cf) {
int64_t res = 0;
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
for (auto i: *cf.get_sstables() ) {
res = std::max(i->get_stats_metadata().estimated_partition_size.max(), res);
}
return res;
@@ -197,7 +196,7 @@ static int64_t max_partition_size(column_family& cf) {
static integral_ratio_holder mean_partition_size(column_family& cf) {
integral_ratio_holder res;
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
for (auto i: *cf.get_sstables() ) {
auto c = i->get_stats_metadata().estimated_partition_size.count();
res.sub += i->get_stats_metadata().estimated_partition_size.mean() * c;
res.total += c;
@@ -275,7 +274,7 @@ public:
static double get_compression_ratio(column_family& cf) {
sum_ratio<double> result;
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
for (auto i : *cf.get_sstables()) {
auto compression_ratio = i->get_compression_ratio();
if (compression_ratio != sstables::metadata_collector::NO_COMPRESSION_RATIO) {
result(compression_ratio);
@@ -312,7 +311,7 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_column_family.set(r, [&ctx] (std::unique_ptr<request> req){
std::list<cf::column_family_info> res;
vector<cf::column_family_info> res;
for (auto i: ctx.db.local().get_column_families_mapping()) {
cf::column_family_info info;
info.ks = i.first.first;
@@ -320,7 +319,7 @@ void set_column_family(http_context& ctx, routes& r) {
info.type = "ColumnFamilies";
res.push_back(info);
}
return make_ready_future<json::json_return_type>(json::stream_range_as_array(std::move(res), std::identity()));
return make_ready_future<json::json_return_type>(json::stream_object(std::move(res)));
});
cf::get_column_family_name_keyspace.set(r, [&ctx] (const_req req){
@@ -425,7 +424,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_estimated_row_size_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
utils::estimated_histogram res(0);
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
for (auto i: *cf.get_sstables() ) {
res.merge(i->get_stats_metadata().estimated_partition_size);
}
return res;
@@ -437,7 +436,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_estimated_row_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {
uint64_t res = 0;
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
for (auto i: *cf.get_sstables() ) {
res += i->get_stats_metadata().estimated_partition_size.count();
}
return res;
@@ -448,7 +447,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_estimated_column_count_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
utils::estimated_histogram res(0);
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
for (auto i: *cf.get_sstables() ) {
res.merge(i->get_stats_metadata().estimated_cells_count);
}
return res;
@@ -600,8 +599,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_false_positive();
});
}, std::plus<uint64_t>());
@@ -609,8 +607,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_false_positive();
});
}, std::plus<uint64_t>());
@@ -618,8 +615,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_recent_false_positive();
});
}, std::plus<uint64_t>());
@@ -627,8 +623,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_recent_false_positive();
});
}, std::plus<uint64_t>());
@@ -660,8 +655,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
@@ -669,8 +663,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
@@ -678,8 +671,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
@@ -687,8 +679,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
@@ -696,8 +687,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
@@ -705,8 +695,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
@@ -984,20 +973,45 @@ void set_column_family(http_context& ctx, routes& r) {
});
});
cf::toppartitions.set(r, [&ctx] (std::unique_ptr<request> req) {
auto name = req->param["name"];
auto [ks, cf] = parse_fully_qualified_cf_name(name);
auto name_param = req->param["name"];
auto [ks, cf] = parse_fully_qualified_cf_name(name_param);
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
api::req_param<unsigned> capacity(*req, "capacity", 256);
api::req_param<unsigned> list_size(*req, "list_size", 10);
apilog.info("toppartitions query: name={} duration={} list_size={} capacity={}",
name, duration.param, list_size.param, capacity.param);
name_param, duration.param, list_size.param, capacity.param);
return seastar::do_with(db::toppartitions_query(ctx.db, {{ks, cf}}, {}, duration.value, list_size, capacity), [&ctx] (db::toppartitions_query& q) {
return run_toppartitions_query(q, ctx, true);
return seastar::do_with(db::toppartitions_query(ctx.db, ks, cf, duration.value, list_size, capacity), [&ctx](auto& q) {
return q.scatter().then([&q] {
return sleep(q.duration()).then([&q] {
return q.gather(q.capacity()).then([&q] (auto topk_results) {
apilog.debug("toppartitions query: processing results");
cf::toppartitions_query_results results;
results.read_cardinality = topk_results.read.size();
results.write_cardinality = topk_results.write.size();
for (auto& d: topk_results.read.top(q.list_size())) {
cf::toppartitions_record r;
r.partition = sstring(d.item);
r.count = d.count;
r.error = d.error;
results.read.push(r);
}
for (auto& d: topk_results.write.top(q.list_size())) {
cf::toppartitions_record r;
r.partition = sstring(d.item);
r.count = d.count;
r.error = d.error;
results.write.push(r);
}
return make_ready_future<json::json_return_type>(results);
});
});
});
});
});

View File

@@ -116,7 +116,4 @@ future<json::json_return_type> get_cf_stats(http_context& ctx, const sstring& n
future<json::json_return_type> get_cf_stats(http_context& ctx,
int64_t column_family_stats::*f);
std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name);
}

View File

@@ -58,7 +58,6 @@ void set_compaction_manager(http_context& ctx, routes& r) {
for (const auto& c : cm.get_compactions()) {
cm::summary s;
s.id = c->compaction_uuid.to_sstring();
s.ks = c->ks_name;
s.cf = c->cf_name;
s.unit = "keys";

View File

@@ -26,7 +26,6 @@
#include <seastar/http/exception.hh>
#include "utils/logalloc.hh"
#include "log.hh"
#include "database.hh"
namespace api {

View File

@@ -96,10 +96,6 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging
return c.get_stats().sent_messages;
}));
get_replied_messages.set(r, get_client_getter(ms, [](const shard_info& c) {
return c.get_stats().replied;
}));
get_dropped_messages.set(r, get_client_getter(ms, [](const shard_info& c) {
// We don't have the same drop message mechanism
// as origin has.
@@ -159,7 +155,6 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging
void unset_messaging_service(http_context& ctx, routes& r) {
get_timeout_messages.unset(r);
get_sent_messages.unset(r);
get_replied_messages.unset(r);
get_dropped_messages.unset(r);
get_exception_messages.unset(r);
get_pending_messages.unset(r);

View File

@@ -23,13 +23,10 @@
#include "api/api-doc/storage_service.json.hh"
#include "db/config.hh"
#include "db/schema_tables.hh"
#include "utils/hash.hh"
#include <sstream>
#include <optional>
#include <time.h>
#include <boost/range/adaptor/map.hpp>
#include <boost/range/adaptor/filtered.hpp>
#include <boost/algorithm/string/trim_all.hpp>
#include <boost/functional/hash.hpp>
#include "service/storage_service.hh"
#include "service/load_meter.hh"
#include "db/commitlog/commitlog.hh"
@@ -49,9 +46,6 @@
#include "transport/controller.hh"
#include "thrift/controller.hh"
#include "locator/token_metadata.hh"
#include "cdc/generation_service.hh"
extern logging::logger apilog;
namespace api {
@@ -100,37 +94,6 @@ static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
};
}
seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request) {
namespace cf = httpd::column_family_json;
return q.scatter().then([&q, legacy_request] {
return sleep(q.duration()).then([&q, legacy_request] {
return q.gather(q.capacity()).then([&q, legacy_request] (auto topk_results) {
apilog.debug("toppartitions query: processing results");
cf::toppartitions_query_results results;
results.read_cardinality = topk_results.read.size();
results.write_cardinality = topk_results.write.size();
for (auto& d: topk_results.read.top(q.list_size())) {
cf::toppartitions_record r;
r.partition = (legacy_request ? "" : "(" + d.item.schema->ks_name() + ":" + d.item.schema->cf_name() + ") ") + sstring(d.item);
r.count = d.count;
r.error = d.error;
results.read.push(r);
}
for (auto& d: topk_results.write.top(q.list_size())) {
cf::toppartitions_record r;
r.partition = (legacy_request ? "" : "(" + d.item.schema->ks_name() + ":" + d.item.schema->cf_name() + ") ") + sstring(d.item);
r.count = d.count;
r.error = d.error;
results.write.push(r);
}
return make_ready_future<json::json_return_type>(results);
});
});
});
}
future<json::json_return_type> set_tables_autocompaction(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
if (tables.empty()) {
tables = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
@@ -196,7 +159,7 @@ void unset_rpc_controller(http_context& ctx, routes& r) {
void set_repair(http_context& ctx, routes& r, sharded<netw::messaging_service>& ms) {
ss::repair_async.set(r, [&ctx, &ms](std::unique_ptr<request> req) {
static std::vector<sstring> options = {"primaryRange", "parallelism", "incremental",
"jobThreads", "ranges", "columnFamilies", "dataCenters", "hosts", "ignore_nodes", "trace",
"jobThreads", "ranges", "columnFamilies", "dataCenters", "hosts", "trace",
"startToken", "endToken" };
std::unordered_map<sstring, sstring> options_map;
for (auto o : options) {
@@ -324,56 +287,6 @@ void set_storage_service(http_context& ctx, routes& r) {
}));
});
ss::toppartitions_generic.set(r, [&ctx] (std::unique_ptr<request> req) {
bool filters_provided = false;
std::unordered_set<std::tuple<sstring, sstring>, utils::tuple_hash> table_filters {};
if (req->query_parameters.contains("table_filters")) {
filters_provided = true;
auto filters = req->get_query_param("table_filters");
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
table_filters.emplace(parse_fully_qualified_cf_name(filter));
}
}
std::unordered_set<sstring> keyspace_filters {};
if (req->query_parameters.contains("keyspace_filters")) {
filters_provided = true;
auto filters = req->get_query_param("keyspace_filters");
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
keyspace_filters.emplace(std::move(filter));
}
}
// when the query is empty return immediately
if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {
apilog.debug("toppartitions query: processing results");
httpd::column_family_json::toppartitions_query_results results;
results.read_cardinality = 0;
results.write_cardinality = 0;
return make_ready_future<json::json_return_type>(results);
}
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
api::req_param<unsigned> capacity(*req, "capacity", 256);
api::req_param<unsigned> list_size(*req, "list_size", 10);
apilog.info("toppartitions query: #table_filters={} #keyspace_filters={} duration={} list_size={} capacity={}",
!table_filters.empty() ? std::to_string(table_filters.size()) : "all", !keyspace_filters.empty() ? std::to_string(keyspace_filters.size()) : "all", duration.param, list_size.param, capacity.param);
return seastar::do_with(db::toppartitions_query(ctx.db, std::move(table_filters), std::move(keyspace_filters), duration.value, list_size, capacity), [&ctx] (db::toppartitions_query& q) {
return run_toppartitions_query(q, ctx);
});
});
ss::get_leaving_nodes.set(r, [&ctx](const_req req) {
return container_to_vec(ctx.get_token_metadata().get_leaving_endpoints());
});
@@ -487,7 +400,7 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::cdc_streams_check_and_repair.set(r, [&ctx] (std::unique_ptr<request> req) {
return service::get_local_storage_service().get_cdc_generation_service().check_and_repair_cdc_streams().then([] {
return service::get_local_storage_service().check_and_repair_cdc_streams().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
@@ -583,22 +496,7 @@ void set_storage_service(http_context& ctx, routes& r) {
ss::remove_node.set(r, [](std::unique_ptr<request> req) {
auto host_id = req->get_query_param("host_id");
std::vector<sstring> ignore_nodes_strs= split(req->get_query_param("ignore_nodes"), ",");
auto ignore_nodes = std::list<gms::inet_address>();
for (std::string n : ignore_nodes_strs) {
try {
std::replace(n.begin(), n.end(), '\"', ' ');
std::replace(n.begin(), n.end(), '\'', ' ');
boost::trim_all(n);
if (!n.empty()) {
auto node = gms::inet_address(n);
ignore_nodes.push_back(node);
}
} catch (...) {
throw std::runtime_error(format("Failed to parse ignore_nodes parameter: ignore_nodes={}, node={}", ignore_nodes_strs, n));
}
}
return service::get_local_storage_service().removenode(host_id, std::move(ignore_nodes)).then([] {
return service::get_local_storage_service().removenode(host_id).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
@@ -818,19 +716,11 @@ void set_storage_service(http_context& ctx, routes& r) {
ss::load_new_ss_tables.set(r, [&ctx](std::unique_ptr<request> req) {
auto ks = validate_keyspace(ctx, req->param);
auto cf = req->get_query_param("cf");
auto stream = req->get_query_param("load_and_stream");
auto primary_replica = req->get_query_param("primary_replica_only");
boost::algorithm::to_lower(stream);
boost::algorithm::to_lower(primary_replica);
bool load_and_stream = stream == "true" || stream == "1";
bool primary_replica_only = primary_replica == "true" || primary_replica == "1";
// No need to add the keyspace, since all we want is to avoid always sending this to the same
// CPU. Even then I am being overzealous here. This is not something that happens all the time.
auto coordinator = std::hash<sstring>()(cf) % smp::count;
return service::get_storage_service().invoke_on(coordinator,
[ks = std::move(ks), cf = std::move(cf),
load_and_stream, primary_replica_only] (service::storage_service& s) {
return s.load_new_sstables(ks, cf, load_and_stream, primary_replica_only);
return service::get_storage_service().invoke_on(coordinator, [ks = std::move(ks), cf = std::move(cf)] (service::storage_service& s) {
return s.load_new_sstables(ks, cf);
}).then_wrapped([] (auto&& f) {
if (f.failed()) {
auto msg = fmt::format("Failed to load new sstables: {}", f.get_exception());
@@ -886,7 +776,6 @@ void set_storage_service(http_context& ctx, routes& r) {
res.enable = tracing::tracing::get_local_tracing_instance().slow_query_tracing_enabled();
res.ttl = tracing::tracing::get_local_tracing_instance().slow_query_record_ttl().count() ;
res.threshold = tracing::tracing::get_local_tracing_instance().slow_query_threshold().count();
res.fast = tracing::tracing::get_local_tracing_instance().ignore_trace_events_enabled();
return res;
});
@@ -894,9 +783,8 @@ void set_storage_service(http_context& ctx, routes& r) {
auto enable = req->get_query_param("enable");
auto ttl = req->get_query_param("ttl");
auto threshold = req->get_query_param("threshold");
auto fast = req->get_query_param("fast");
try {
return tracing::tracing::tracing_instance().invoke_on_all([enable, ttl, threshold, fast] (auto& local_tracing) {
return tracing::tracing::tracing_instance().invoke_on_all([enable, ttl, threshold] (auto& local_tracing) {
if (threshold != "") {
local_tracing.set_slow_query_threshold(std::chrono::microseconds(std::stol(threshold.c_str())));
}
@@ -906,9 +794,6 @@ void set_storage_service(http_context& ctx, routes& r) {
if (enable != "") {
local_tracing.set_slow_query_enabled(strcasecmp(enable.c_str(), "true") == 0);
}
if (fast != "") {
local_tracing.set_ignore_trace_events(strcasecmp(fast.c_str(), "true") == 0);
}
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -1078,7 +963,7 @@ void set_storage_service(http_context& ctx, routes& r) {
tst.keyspace = schema->ks_name();
tst.table = schema->cf_name();
for (auto sstables = t->get_sstables_including_compacted_undeleted(); auto sstable : *sstables) {
for (auto sstable : *t->get_sstables_including_compacted_undeleted()) {
auto ts = db_clock::to_time_t(sstable->data_file_write_time());
::tm t;
::gmtime_r(&ts, &t);

View File

@@ -23,7 +23,6 @@
#include <seastar/core/sharded.hh>
#include "api.hh"
#include "db/data_listeners.hh"
namespace cql_transport { class controller; }
class thrift_controller;
@@ -41,6 +40,5 @@ void set_rpc_controller(http_context& ctx, routes& r, thrift_controller& ctl);
void unset_rpc_controller(http_context& ctx, routes& r);
void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_ctl);
void unset_snapshot(http_context& ctx, routes& r);
seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request = false);
}

View File

@@ -25,9 +25,6 @@
#include <seastar/core/reactor.hh>
#include <seastar/http/exception.hh>
#include "log.hh"
#include "database.hh"
extern logging::logger apilog;
namespace api {
@@ -73,16 +70,6 @@ void set_system(http_context& ctx, routes& r) {
}
return json::json_void();
});
hs::drop_sstable_caches.set(r, [&ctx](std::unique_ptr<request> req) {
apilog.info("Dropping sstable caches");
return ctx.db.invoke_on_all([] (database& db) {
return db.drop_caches();
}).then([] {
apilog.info("Caches dropped");
return json::json_return_type(json::json_void());
});
});
}
}

View File

@@ -24,130 +24,142 @@
#include "counters.hh"
#include "types.hh"
/// LSA mirator for cells with irrelevant type
///
///
const data::type_imr_descriptor& no_type_imr_descriptor() {
static thread_local data::type_imr_descriptor state(data::type_info::make_variable_size());
return state;
}
atomic_cell atomic_cell::make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
return atomic_cell_type::make_dead(timestamp, deletion_time);
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_dead(timestamp, deletion_time), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value, atomic_cell::collection_member cm) {
return atomic_cell_type::make_live(timestamp, single_fragment_range(value));
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value, atomic_cell::collection_member cm) {
return atomic_cell_type::make_live(timestamp, fragment_range(value));
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value, atomic_cell::collection_member cm) {
return atomic_cell_type::make_live(timestamp, value);
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value, collection_member cm)
{
return atomic_cell_type::make_live(timestamp, value);
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
return atomic_cell_type::make_live(timestamp, single_fragment_range(value), expiry, ttl);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
return atomic_cell_type::make_live(timestamp, fragment_range(value), expiry, ttl);
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
return atomic_cell_type::make_live(timestamp, value, expiry, ttl);
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member cm)
{
return atomic_cell_type::make_live(timestamp, value, expiry, ttl);
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
return atomic_cell_type::make_live_counter_update(timestamp, value);
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live_counter_update(timestamp, value), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live_uninitialized(const abstract_type& type, api::timestamp_type timestamp, size_t size) {
return atomic_cell_type::make_live_uninitialized(timestamp, size);
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live_uninitialized(imr_data.type_info(), timestamp, size), &imr_data.lsa_migrator())
);
}
static imr::utils::object<data::cell::structure> copy_cell(const data::type_imr_descriptor& imr_data, const uint8_t* ptr)
{
using imr_object_type = imr::utils::object<data::cell::structure>;
// If the cell doesn't own any memory it is trivial and can be copied with
// memcpy.
auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);
if (!f.template get<data::cell::tags::external_data>()) {
data::cell::context ctx(f, imr_data.type_info());
// XXX: We may be better off storing the total cell size in memory. Measure!
auto size = data::cell::structure::serialized_object_size(ptr, ctx);
return imr_object_type::make_raw(size, [&] (uint8_t* dst) noexcept {
std::copy_n(ptr, size, dst);
}, &imr_data.lsa_migrator());
}
return imr_object_type::make(data::cell::copy_fn(imr_data.type_info(), ptr), &imr_data.lsa_migrator());
}
atomic_cell::atomic_cell(const abstract_type& type, atomic_cell_view other)
: _data(other._view) {
set_view(_data);
}
// Based on:
// - org.apache.cassandra.db.AbstractCell#reconcile()
// - org.apache.cassandra.db.BufferExpiringCell#reconcile()
// - org.apache.cassandra.db.BufferDeletedCell#reconcile()
int
compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {
if (left.timestamp() != right.timestamp()) {
return left.timestamp() > right.timestamp() ? 1 : -1;
}
if (left.is_live() != right.is_live()) {
return left.is_live() ? -1 : 1;
}
if (left.is_live()) {
auto c = compare_unsigned(left.value(), right.value());
if (c != 0) {
return c;
}
if (left.is_live_and_has_ttl() != right.is_live_and_has_ttl()) {
// prefer expiring cells.
return left.is_live_and_has_ttl() ? 1 : -1;
}
if (left.is_live_and_has_ttl()) {
if (left.expiry() != right.expiry()) {
return left.expiry() < right.expiry() ? -1 : 1;
} else {
// prefer the cell that was written later,
// so it survives longer after it expires, until purged.
if (left.ttl() != right.ttl()) {
return left.ttl() < right.ttl() ? 1 : -1;
} else {
return 0;
}
}
}
} else {
// Both are deleted
if (left.deletion_time() != right.deletion_time()) {
// Origin compares big-endian serialized deletion time. That's because it
// delegates to AbstractCell.reconcile() which compares values after
// comparing timestamps, which in case of deleted cells will hold
// serialized expiry.
return (uint64_t) left.deletion_time().time_since_epoch().count()
< (uint64_t) right.deletion_time().time_since_epoch().count() ? -1 : 1;
}
}
return 0;
}
: atomic_cell(type.imr_state().type_info(),
copy_cell(type.imr_state(), other._view.raw_pointer()))
{ }
atomic_cell_or_collection atomic_cell_or_collection::copy(const abstract_type& type) const {
if (_data.empty()) {
if (!_data.get()) {
return atomic_cell_or_collection();
}
return atomic_cell_or_collection(managed_bytes(_data));
auto& imr_data = type.imr_state();
return atomic_cell_or_collection(
copy_cell(imr_data, _data.get())
);
}
atomic_cell_or_collection::atomic_cell_or_collection(const abstract_type& type, atomic_cell_view acv)
: _data(acv._view)
: _data(copy_cell(type.imr_state(), acv._view.raw_pointer()))
{
}
bool atomic_cell_or_collection::equals(const abstract_type& type, const atomic_cell_or_collection& other) const
{
if (_data.empty() || other._data.empty()) {
return _data.empty() && other._data.empty();
auto ptr_a = _data.get();
auto ptr_b = other._data.get();
if (!ptr_a || !ptr_b) {
return !ptr_a && !ptr_b;
}
if (type.is_atomic()) {
auto a = atomic_cell_view::from_bytes(type, _data);
auto b = atomic_cell_view::from_bytes(type, other._data);
auto a = atomic_cell_view::from_bytes(type.imr_state().type_info(), _data);
auto b = atomic_cell_view::from_bytes(type.imr_state().type_info(), other._data);
if (a.timestamp() != b.timestamp()) {
return false;
}
@@ -179,7 +191,28 @@ bool atomic_cell_or_collection::equals(const abstract_type& type, const atomic_c
size_t atomic_cell_or_collection::external_memory_usage(const abstract_type& t) const
{
return _data.external_memory_usage();
if (!_data.get()) {
return 0;
}
auto ctx = data::cell::context(_data.get(), t.imr_state().type_info());
auto view = data::cell::structure::make_view(_data.get(), ctx);
auto flags = view.get<data::cell::tags::flags>();
size_t external_value_size = 0;
if (flags.get<data::cell::tags::external_data>()) {
if (flags.get<data::cell::tags::collection>()) {
external_value_size = as_collection_mutation().data.size_bytes();
} else {
auto cell_view = data::cell::atomic_cell_view(t.imr_state().type_info(), view);
external_value_size = cell_view.value_size();
}
// Add overhead of chunk headers. The last one is a special case.
external_value_size += (external_value_size - 1) / data::cell::effective_external_chunk_length * data::cell::external_chunk_overhead;
external_value_size += data::cell::external_last_chunk_overhead;
}
return data::cell::structure::serialized_object_size(_data.get(), ctx)
+ imr_object_type::size_overhead + external_value_size;
}
std::ostream&
@@ -188,7 +221,7 @@ operator<<(std::ostream& os, const atomic_cell_view& acv) {
return fmt_print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",
acv.is_counter_update()
? "counter_update_value=" + to_sstring(acv.counter_update_value())
: to_hex(to_bytes(acv.value())),
: to_hex(acv.value().linearize()),
acv.timestamp(),
acv.is_live_and_has_ttl() ? acv.expiry().time_since_epoch().count() : -1,
acv.is_live_and_has_ttl() ? acv.ttl().count() : 0);
@@ -214,11 +247,12 @@ operator<<(std::ostream& os, const atomic_cell_view::printer& acvp) {
cell_value_string_builder << "counter_update_value=" << acv.counter_update_value();
} else {
cell_value_string_builder << "shards: ";
auto ccv = counter_cell_view(acv);
cell_value_string_builder << ::join(", ", ccv.shards());
counter_cell_view::with_linearized(acv, [&cell_value_string_builder] (counter_cell_view& ccv) {
cell_value_string_builder << ::join(", ", ccv.shards());
});
}
} else {
cell_value_string_builder << type.to_string(to_bytes(acv.value()));
cell_value_string_builder << type.to_string(acv.value().linearize());
}
return fmt_print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",
cell_value_string_builder.str(),
@@ -237,11 +271,12 @@ operator<<(std::ostream& os, const atomic_cell::printer& acp) {
}
std::ostream& operator<<(std::ostream& os, const atomic_cell_or_collection::printer& p) {
if (p._cell._data.empty()) {
if (!p._cell._data.get()) {
return os << "{ null atomic_cell_or_collection }";
}
using dc = data::cell;
os << "{ ";
if (p._cdef.type->is_multi_cell()) {
if (dc::structure::get_member<dc::tags::flags>(p._cell._data.get()).get<dc::tags::collection>()) {
os << "collection ";
auto cmv = p._cell.as_collection_mutation();
os << collection_mutation_view::printer(*p._cdef.type, cmv);

View File

@@ -26,12 +26,12 @@
#include "tombstone.hh"
#include "gc_clock.hh"
#include "utils/managed_bytes.hh"
#include "utils/fragment_range.hh"
#include <seastar/net//byteorder.hh>
#include <seastar/util/bool_class.hh>
#include <cstdint>
#include <iosfwd>
#include <concepts>
#include "data/cell.hh"
#include "data/schema_info.hh"
#include "imr/utils.hh"
#include "utils/fragmented_temporary_buffer.hh"
#include "serializer.hh"
@@ -40,191 +40,41 @@ class abstract_type;
class collection_type_impl;
class atomic_cell_or_collection;
using atomic_cell_value = managed_bytes;
template <mutable_view is_mutable>
using atomic_cell_value_basic_view = managed_bytes_basic_view<is_mutable>;
using atomic_cell_value_view = atomic_cell_value_basic_view<mutable_view::no>;
using atomic_cell_value_mutable_view = atomic_cell_value_basic_view<mutable_view::yes>;
template <typename T>
requires std::is_trivial_v<T>
static void set_field(atomic_cell_value_mutable_view& out, unsigned offset, T val) {
auto out_view = managed_bytes_mutable_view(out);
out_view.remove_prefix(offset);
write<T>(out_view, val);
}
template <typename T>
requires std::is_trivial_v<T>
static void set_field(atomic_cell_value& out, unsigned offset, T val) {
auto out_view = atomic_cell_value_mutable_view(out);
set_field(out_view, offset, val);
}
template <FragmentRange Buffer>
static void set_value(managed_bytes& b, unsigned value_offset, const Buffer& value) {
auto v = managed_bytes_mutable_view(b).substr(value_offset, value.size_bytes());
for (auto frag : value) {
write_fragmented(v, single_fragmented_view(frag));
}
}
template <typename T, FragmentedView Input>
requires std::is_trivial_v<T>
static T get_field(Input in, unsigned offset = 0) {
in.remove_prefix(offset);
return read_simple<T>(in);
}
/*
* Represents atomic cell layout. Works on serialized form.
*
* Layout:
*
* <live> := <int8_t:flags><int64_t:timestamp>(<int64_t:expiry><int32_t:ttl>)?<value>
* <dead> := <int8_t: 0><int64_t:timestamp><int64_t:deletion_time>
*/
class atomic_cell_type final {
private:
static constexpr int8_t LIVE_FLAG = 0x01;
static constexpr int8_t EXPIRY_FLAG = 0x02; // When present, expiry field is present. Set only for live cells
static constexpr int8_t COUNTER_UPDATE_FLAG = 0x08; // Cell is a counter update.
static constexpr unsigned flags_size = 1;
static constexpr unsigned timestamp_offset = flags_size;
static constexpr unsigned timestamp_size = 8;
static constexpr unsigned expiry_offset = timestamp_offset + timestamp_size;
static constexpr unsigned expiry_size = 8;
static constexpr unsigned deletion_time_offset = timestamp_offset + timestamp_size;
static constexpr unsigned deletion_time_size = 8;
static constexpr unsigned ttl_offset = expiry_offset + expiry_size;
static constexpr unsigned ttl_size = 4;
friend class counter_cell_builder;
private:
static bool is_counter_update(atomic_cell_value_view cell) {
return cell.front() & COUNTER_UPDATE_FLAG;
}
static bool is_live(atomic_cell_value_view cell) {
return cell.front() & LIVE_FLAG;
}
static bool is_live_and_has_ttl(atomic_cell_value_view cell) {
return cell.front() & EXPIRY_FLAG;
}
static bool is_dead(atomic_cell_value_view cell) {
return !is_live(cell);
}
// Can be called on live and dead cells
static api::timestamp_type timestamp(atomic_cell_value_view cell) {
return get_field<api::timestamp_type>(cell, timestamp_offset);
}
static void set_timestamp(atomic_cell_value_mutable_view& cell, api::timestamp_type ts) {
set_field(cell, timestamp_offset, ts);
}
// Can be called on live cells only
private:
template <mutable_view is_mutable>
static managed_bytes_basic_view<is_mutable> do_get_value(managed_bytes_basic_view<is_mutable> cell) {
auto expiry_field_size = bool(cell.front() & EXPIRY_FLAG) * (expiry_size + ttl_size);
auto value_offset = flags_size + timestamp_size + expiry_field_size;
cell.remove_prefix(value_offset);
return cell;
}
public:
static atomic_cell_value_view value(managed_bytes_view cell) {
return do_get_value(cell);
}
static atomic_cell_value_mutable_view value(managed_bytes_mutable_view cell) {
return do_get_value(cell);
}
// Can be called on live counter update cells only
static int64_t counter_update_value(atomic_cell_value_view cell) {
return get_field<int64_t>(cell, flags_size + timestamp_size);
}
// Can be called only when is_dead() is true.
static gc_clock::time_point deletion_time(atomic_cell_value_view cell) {
assert(is_dead(cell));
return gc_clock::time_point(gc_clock::duration(get_field<int64_t>(cell, deletion_time_offset)));
}
// Can be called only when is_live_and_has_ttl() is true.
static gc_clock::time_point expiry(atomic_cell_value_view cell) {
assert(is_live_and_has_ttl(cell));
auto expiry = get_field<int64_t>(cell, expiry_offset);
return gc_clock::time_point(gc_clock::duration(expiry));
}
// Can be called only when is_live_and_has_ttl() is true.
static gc_clock::duration ttl(atomic_cell_value_view cell) {
assert(is_live_and_has_ttl(cell));
return gc_clock::duration(get_field<int32_t>(cell, ttl_offset));
}
static managed_bytes make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
managed_bytes b(managed_bytes::initialized_later(), flags_size + timestamp_size + deletion_time_size);
b[0] = 0;
set_field(b, timestamp_offset, timestamp);
set_field(b, deletion_time_offset, static_cast<int64_t>(deletion_time.time_since_epoch().count()));
return b;
}
template <FragmentRange Buffer>
static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size_bytes());
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_value(b, value_offset, value);
return b;
}
static managed_bytes make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + sizeof(value));
b[0] = LIVE_FLAG | COUNTER_UPDATE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, value_offset, value);
return b;
}
template <FragmentRange Buffer>
static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value, gc_clock::time_point expiry, gc_clock::duration ttl) {
auto value_offset = flags_size + timestamp_size + expiry_size + ttl_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size_bytes());
b[0] = EXPIRY_FLAG | LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, expiry_offset, static_cast<int64_t>(expiry.time_since_epoch().count()));
set_field(b, ttl_offset, static_cast<int32_t>(ttl.count()));
set_value(b, value_offset, value);
return b;
}
static managed_bytes make_live_uninitialized(api::timestamp_type timestamp, size_t size) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + size);
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
return b;
}
template <mutable_view is_mutable>
friend class basic_atomic_cell_view;
friend class atomic_cell;
};
using atomic_cell_value_view = data::value_view;
using atomic_cell_value_mutable_view = data::value_mutable_view;
/// View of an atomic cell
template<mutable_view is_mutable>
class basic_atomic_cell_view {
protected:
managed_bytes_basic_view<is_mutable> _view;
friend class atomic_cell;
data::cell::basic_atomic_cell_view<is_mutable> _view;
friend class atomic_cell;
public:
using pointer_type = std::conditional_t<is_mutable == mutable_view::no, const uint8_t*, uint8_t*>;
protected:
void set_view(managed_bytes_basic_view<is_mutable> v) {
_view = v;
}
basic_atomic_cell_view() = default;
explicit basic_atomic_cell_view(managed_bytes_basic_view<is_mutable> v) : _view(std::move(v)) { }
explicit basic_atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> v)
: _view(std::move(v)) { }
basic_atomic_cell_view(const data::type_info& ti, pointer_type ptr)
: _view(data::cell::make_atomic_cell_view(ti, ptr))
{ }
friend class atomic_cell_or_collection;
public:
operator basic_atomic_cell_view<mutable_view::no>() const noexcept {
return basic_atomic_cell_view<mutable_view::no>(_view);
}
void swap(basic_atomic_cell_view& other) noexcept {
using std::swap;
swap(_view, other._view);
}
bool is_counter_update() const {
return atomic_cell_type::is_counter_update(_view);
return _view.is_counter_update();
}
bool is_live() const {
return atomic_cell_type::is_live(_view);
return _view.is_live();
}
bool is_live(tombstone t, bool is_counter) const {
return is_live() && !is_covered_by(t, is_counter);
@@ -233,72 +83,73 @@ public:
return is_live() && !is_covered_by(t, is_counter) && !has_expired(now);
}
bool is_live_and_has_ttl() const {
return atomic_cell_type::is_live_and_has_ttl(_view);
return _view.is_expiring();
}
bool is_dead(gc_clock::time_point now) const {
return atomic_cell_type::is_dead(_view) || has_expired(now);
return !is_live() || has_expired(now);
}
bool is_covered_by(tombstone t, bool is_counter) const {
return timestamp() <= t.timestamp || (is_counter && t.timestamp != api::missing_timestamp);
}
// Can be called on live and dead cells
api::timestamp_type timestamp() const {
return atomic_cell_type::timestamp(_view);
return _view.timestamp();
}
void set_timestamp(api::timestamp_type ts) {
atomic_cell_type::set_timestamp(_view, ts);
_view.set_timestamp(ts);
}
// Can be called on live cells only
atomic_cell_value_basic_view<is_mutable> value() const {
return atomic_cell_type::value(_view);
data::basic_value_view<is_mutable> value() const {
return _view.value();
}
// Can be called on live cells only
size_t value_size() const {
return atomic_cell_type::value(_view).size();
return _view.value_size();
}
bool is_value_fragmented() const {
return _view.is_fragmented();
return _view.is_value_fragmented();
}
// Can be called on live counter update cells only
int64_t counter_update_value() const {
return atomic_cell_type::counter_update_value(_view);
return _view.counter_update_value();
}
// Can be called only when is_dead(gc_clock::time_point)
gc_clock::time_point deletion_time() const {
return !is_live() ? atomic_cell_type::deletion_time(_view) : expiry() - ttl();
return !is_live() ? _view.deletion_time() : expiry() - ttl();
}
// Can be called only when is_live_and_has_ttl()
gc_clock::time_point expiry() const {
return atomic_cell_type::expiry(_view);
return _view.expiry();
}
// Can be called only when is_live_and_has_ttl()
gc_clock::duration ttl() const {
return atomic_cell_type::ttl(_view);
return _view.ttl();
}
// Can be called on live and dead cells
bool has_expired(gc_clock::time_point now) const {
return is_live_and_has_ttl() && expiry() <= now;
}
managed_bytes_view serialize() const {
return _view;
bytes_view serialize() const {
return _view.serialize();
}
};
class atomic_cell_view final : public basic_atomic_cell_view<mutable_view::no> {
atomic_cell_view(managed_bytes_view v)
: basic_atomic_cell_view(v) {}
atomic_cell_view(const data::type_info& ti, const uint8_t* data)
: basic_atomic_cell_view<mutable_view::no>(ti, data) {}
template<mutable_view is_mutable>
atomic_cell_view(basic_atomic_cell_view<is_mutable> view)
: basic_atomic_cell_view<mutable_view::no>(view) {}
atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> view)
: basic_atomic_cell_view<mutable_view::no>(view) { }
friend class atomic_cell;
public:
static atomic_cell_view from_bytes(const abstract_type& t, managed_bytes_view v) {
return atomic_cell_view(v);
static atomic_cell_view from_bytes(const data::type_info& ti, const imr::utils::object<data::cell::structure>& data) {
return atomic_cell_view(ti, data.get());
}
static atomic_cell_view from_bytes(const abstract_type& t, bytes_view v) {
return atomic_cell_view(managed_bytes_view(v));
static atomic_cell_view from_bytes(const data::type_info& ti, bytes_view bv) {
return atomic_cell_view(ti, reinterpret_cast<const uint8_t*>(bv.begin()));
}
friend std::ostream& operator<<(std::ostream& os, const atomic_cell_view& acv);
@@ -313,11 +164,11 @@ public:
};
class atomic_cell_mutable_view final : public basic_atomic_cell_view<mutable_view::yes> {
atomic_cell_mutable_view(managed_bytes_mutable_view data)
: basic_atomic_cell_view(data) {}
atomic_cell_mutable_view(const data::type_info& ti, uint8_t* data)
: basic_atomic_cell_view<mutable_view::yes>(ti, data) {}
public:
static atomic_cell_mutable_view from_bytes(const abstract_type& t, managed_bytes_mutable_view v) {
return atomic_cell_mutable_view(v);
static atomic_cell_mutable_view from_bytes(const data::type_info& ti, imr::utils::object<data::cell::structure>& data) {
return atomic_cell_mutable_view(ti, data.get());
}
friend class atomic_cell;
@@ -326,31 +177,26 @@ public:
using atomic_cell_ref = atomic_cell_mutable_view;
class atomic_cell final : public basic_atomic_cell_view<mutable_view::yes> {
managed_bytes _data;
atomic_cell(managed_bytes b) : _data(std::move(b)) {
set_view(_data);
}
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
atomic_cell(const data::type_info& ti, imr::utils::object<data::cell::structure>&& data)
: basic_atomic_cell_view<mutable_view::yes>(ti, data.get()), _data(std::move(data)) {}
public:
class collection_member_tag;
using collection_member = bool_class<collection_member_tag>;
atomic_cell(atomic_cell&& o) noexcept : _data(std::move(o._data)) {
set_view(_data);
}
atomic_cell(atomic_cell&&) = default;
atomic_cell& operator=(const atomic_cell&) = delete;
atomic_cell& operator=(atomic_cell&& o) {
_data = std::move(o._data);
set_view(_data);
return *this;
atomic_cell& operator=(atomic_cell&&) = default;
void swap(atomic_cell& other) noexcept {
basic_atomic_cell_view<mutable_view::yes>::swap(other);
_data.swap(other._data);
}
operator atomic_cell_view() const { return atomic_cell_view(managed_bytes_view(_data)); }
operator atomic_cell_view() const { return atomic_cell_view(_view); }
atomic_cell(const abstract_type& t, atomic_cell_view other);
static atomic_cell make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
@@ -362,8 +208,6 @@ public:
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, int64_t value);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, managed_bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,

View File

@@ -52,7 +52,9 @@ struct appending_hash<atomic_cell_view> {
feed_hash(h, cell.timestamp());
if (cell.is_live()) {
if (cdef.is_counter()) {
::feed_hash(h, counter_cell_view(cell));
counter_cell_view::with_linearized(cell, [&] (counter_cell_view ccv) {
::feed_hash(h, ccv);
});
return;
}
if (cell.is_live_and_has_ttl()) {

View File

@@ -26,14 +26,20 @@
#include "schema.hh"
#include "hashing.hh"
#include "imr/utils.hh"
// A variant type that can hold either an atomic_cell, or a serialized collection.
// Which type is stored is determined by the schema.
// Has an "empty" state.
// Objects moved-from are left in an empty state.
class atomic_cell_or_collection final {
managed_bytes _data;
// FIXME: This has made us lose small-buffer optimisation. Unfortunately,
// due to the changed cell format it would be less effective now, anyway.
// Measure the actual impact because any attempts to fix this will become
// irrelevant once rows are converted to the IMR as well, so maybe we can
// live with this like that.
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
private:
atomic_cell_or_collection(managed_bytes&& data) : _data(std::move(data)) {}
atomic_cell_or_collection(imr::utils::object<data::cell::structure>&& data) : _data(std::move(data)) {}
public:
atomic_cell_or_collection() = default;
atomic_cell_or_collection(atomic_cell_or_collection&&) = default;
@@ -43,16 +49,20 @@ public:
atomic_cell_or_collection(atomic_cell ac) : _data(std::move(ac._data)) {}
atomic_cell_or_collection(const abstract_type& at, atomic_cell_view acv);
static atomic_cell_or_collection from_atomic_cell(atomic_cell data) { return { std::move(data._data) }; }
atomic_cell_view as_atomic_cell(const column_definition& cdef) const { return atomic_cell_view::from_bytes(*cdef.type, _data); }
atomic_cell_mutable_view as_mutable_atomic_cell(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(*cdef.type, _data); }
atomic_cell_view as_atomic_cell(const column_definition& cdef) const { return atomic_cell_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_ref as_atomic_cell_ref(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_mutable_view as_mutable_atomic_cell(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm._data)) { }
atomic_cell_or_collection copy(const abstract_type&) const;
explicit operator bool() const {
return !_data.empty();
return bool(_data);
}
static constexpr bool can_use_mutable_view() {
return true;
}
void swap(atomic_cell_or_collection& other) noexcept {
_data.swap(other._data);
}
static atomic_cell_or_collection from_collection_mutation(collection_mutation data) { return std::move(data._data); }
collection_mutation_view as_collection_mutation() const;
bytes_view serialize() const;
@@ -72,3 +82,12 @@ public:
};
friend std::ostream& operator<<(std::ostream&, const printer&);
};
namespace std {
inline void swap(atomic_cell_or_collection& a, atomic_cell_or_collection& b) noexcept
{
a.swap(b);
}
}

View File

@@ -108,7 +108,7 @@ future<> wait_for_schema_agreement(::service::migration_manager& mm, const datab
});
}
::service::query_state& internal_distributed_query_state() noexcept {
const timeout_config& internal_distributed_timeout_config() noexcept {
#ifdef DEBUG
// Give the much slower debug tests more headroom for completing auth queries.
static const auto t = 30s;
@@ -116,9 +116,7 @@ future<> wait_for_schema_agreement(::service::migration_manager& mm, const datab
static const auto t = 5s;
#endif
static const timeout_config tc{t, t, t, t, t, t, t};
static thread_local ::service::client_state cs(::service::client_state::internal_tag{}, tc);
static thread_local ::service::query_state qs(cs, empty_service_permit());
return qs;
return tc;
}
}

View File

@@ -35,7 +35,6 @@
#include "log.hh"
#include "seastarx.hh"
#include "utils/exponential_backoff_retry.hh"
#include "service/query_state.hh"
using namespace std::chrono_literals;
@@ -88,6 +87,6 @@ future<> wait_for_schema_agreement(::service::migration_manager&, const database
///
/// Time-outs for internal, non-local CQL queries.
///
::service::query_state& internal_distributed_query_state() noexcept;
const timeout_config& internal_distributed_timeout_config() noexcept;
}

View File

@@ -103,6 +103,7 @@ future<bool> default_authorizer::any_granted() const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{},
true).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return !results->empty();
@@ -115,7 +116,8 @@ future<> default_authorizer::migrate_legacy_metadata() const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE).then([this](::shared_ptr<cql3::untyped_result_set> results) {
db::consistency_level::LOCAL_ONE,
infinite_timeout_config).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
return do_with(
row.get_as<sstring>("username"),
@@ -195,6 +197,7 @@ default_authorizer::authorize(const role_or_anonymous& maybe_role, const resourc
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{*maybe_role.name, r.name()}).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return permissions::NONE;
@@ -223,7 +226,7 @@ default_authorizer::modify(
return _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{permissions::to_strings(set), sstring(role_name), resource.name()}).discard_result();
});
}
@@ -248,7 +251,7 @@ future<std::vector<permission_details>> default_authorizer::list_all() const {
return _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
std::vector<permission_details> all_details;
@@ -275,7 +278,7 @@ future<> default_authorizer::revoke_all(std::string_view role_name) const {
return _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{sstring(role_name)}).discard_result().handle_exception([role_name](auto ep) {
try {
std::rethrow_exception(ep);
@@ -295,6 +298,7 @@ future<> default_authorizer::revoke_all(const resource& resource) const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{resource.name()}).then_wrapped([this, resource](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
@@ -311,6 +315,7 @@ future<> default_authorizer::revoke_all(const resource& resource) const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{r.get_as<sstring>(ROLE_NAME), resource.name()}).discard_result().handle_exception(
[resource](auto ep) {
try {

View File

@@ -114,7 +114,7 @@ future<> password_authenticator::migrate_legacy_metadata() const {
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
auto username = row.get_as<sstring>("username");
auto salted_hash = row.get_as<sstring>(SALTED_HASH);
@@ -122,7 +122,7 @@ future<> password_authenticator::migrate_legacy_metadata() const {
return _qp.execute_internal(
update_row_query(),
consistency_for_user(username),
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{std::move(salted_hash), username}).discard_result();
}).finally([results] {});
}).then([] {
@@ -139,7 +139,7 @@ future<> password_authenticator::create_default_if_missing() const {
return _qp.execute_internal(
update_row_query(),
db::consistency_level::QUORUM,
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt), DEFAULT_USER_NAME}).then([](auto&&) {
plogger.info("Created default superuser authentication record.");
});
@@ -236,7 +236,7 @@ future<authenticated_user> password_authenticator::authenticate(
return _qp.execute_internal(
query,
consistency_for_user(username),
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{username},
true);
}).then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {
@@ -270,7 +270,7 @@ future<> password_authenticator::create(std::string_view role_name, const authen
return _qp.execute_internal(
update_row_query(),
consistency_for_user(role_name),
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();
}
@@ -287,7 +287,7 @@ future<> password_authenticator::alter(std::string_view role_name, const authent
return _qp.execute_internal(
query,
consistency_for_user(role_name),
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();
}
@@ -299,7 +299,7 @@ future<> password_authenticator::drop(std::string_view name) const {
return _qp.execute_internal(
query, consistency_for_user(name),
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{sstring(name)}).discard_result();
}

View File

@@ -68,13 +68,14 @@ future<bool> default_role_row_satisfies(
return qp.execute_internal(
query,
db::consistency_level::ONE,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
@@ -99,7 +100,7 @@ future<bool> any_nondefault_role_row_satisfies(
return qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state()).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_timeout_config()).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return false;
}

View File

@@ -210,6 +210,7 @@ future<bool> service::has_existing_legacy_users() const {
return _qp.execute_internal(
default_user_query,
db::consistency_level::ONE,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
if (!results->empty()) {
@@ -219,6 +220,7 @@ future<bool> service::has_existing_legacy_users() const {
return _qp.execute_internal(
default_user_query,
db::consistency_level::QUORUM,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
if (!results->empty()) {
@@ -227,7 +229,8 @@ future<bool> service::has_existing_legacy_users() const {
return _qp.execute_internal(
all_users_query,
db::consistency_level::QUORUM).then([](auto results) {
db::consistency_level::QUORUM,
infinite_timeout_config).then([](auto results) {
return make_ready_future<bool>(!results->empty());
});
});

View File

@@ -86,7 +86,7 @@ static future<std::optional<record>> find_record(cql3::query_processor& qp, std:
return qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{sstring(role_name)},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
@@ -165,7 +165,7 @@ future<> standard_role_manager::create_default_role_if_missing() const {
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{meta::DEFAULT_SUPERUSER_NAME}).then([](auto&&) {
log.info("Created default superuser role '{}'.", meta::DEFAULT_SUPERUSER_NAME);
return make_ready_future<>();
@@ -192,7 +192,7 @@ future<> standard_role_manager::migrate_legacy_metadata() const {
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
role_config config;
config.is_superuser = row.get_or<bool>("super", false);
@@ -253,7 +253,7 @@ future<> standard_role_manager::create_or_replace(std::string_view role_name, co
return _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{sstring(role_name), c.is_superuser, c.can_login},
true).discard_result();
}
@@ -296,7 +296,7 @@ standard_role_manager::alter(std::string_view role_name, const role_config_updat
build_column_assignments(u),
meta::roles_table::role_col_name),
consistency_for_role(role_name),
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{sstring(role_name)}).discard_result();
});
}
@@ -315,7 +315,7 @@ future<> standard_role_manager::drop(std::string_view role_name) const {
return _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{sstring(role_name)}).then([this, role_name](::shared_ptr<cql3::untyped_result_set> members) {
return parallel_for_each(
members->begin(),
@@ -354,7 +354,7 @@ future<> standard_role_manager::drop(std::string_view role_name) const {
return _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{sstring(role_name)}).discard_result();
};
@@ -381,7 +381,7 @@ standard_role_manager::modify_membership(
return _qp.execute_internal(
query,
consistency_for_role(grantee_name),
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{role_set{sstring(role_name)}, sstring(grantee_name)}).discard_result();
};
@@ -392,7 +392,7 @@ standard_role_manager::modify_membership(
format("INSERT INTO {} (role, member) VALUES (?, ?)",
meta::role_members_table::qualified_name),
consistency_for_role(role_name),
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
case membership_change::remove:
@@ -400,7 +400,7 @@ standard_role_manager::modify_membership(
format("DELETE FROM {} WHERE role = ? AND member = ?",
meta::role_members_table::qualified_name),
consistency_for_role(role_name),
internal_distributed_query_state(),
internal_distributed_timeout_config(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
}
@@ -503,7 +503,7 @@ future<role_set> standard_role_manager::query_all() const {
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state()).then([](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_timeout_config()).then([](::shared_ptr<cql3::untyped_result_set> results) {
role_set roles;
std::transform(

View File

@@ -24,10 +24,9 @@
#include <boost/range/iterator_range.hpp>
#include "bytes.hh"
#include <seastar/core/unaligned.hh>
#include "hashing.hh"
#include <seastar/core/simple-stream.hh>
#include <concepts>
/**
* Utility for writing data into a buffer when its final size is not known up front.
*
@@ -234,9 +233,9 @@ public:
};
// Returns a place holder for a value to be written later.
template <std::integral T>
template <typename T>
inline
place_holder<T>
std::enable_if_t<std::is_fundamental<T>::value, place_holder<T>>
write_place_holder() {
return place_holder<T>{alloc(sizeof(T))};
}

View File

@@ -102,7 +102,7 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
// Points to the underlying reader conforming to _schema,
// either to *_underlying_holder or _read_context->underlying().underlying().
flat_mutation_reader* _underlying = nullptr;
flat_mutation_reader_opt _underlying_holder;
std::optional<flat_mutation_reader> _underlying_holder;
future<> do_fill_buffer(db::timeout_clock::time_point);
future<> ensure_underlying(db::timeout_clock::time_point);
@@ -112,7 +112,6 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
void move_to_next_range();
void move_to_range(query::clustering_row_ranges::const_iterator);
void move_to_next_entry();
void maybe_drop_last_entry() noexcept;
void add_to_buffer(const partition_snapshot_row_cursor&);
void add_clustering_row_to_buffer(mutation_fragment&&);
void add_to_buffer(range_tombstone&&);
@@ -123,7 +122,6 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
bool can_populate() const;
// Marks the range between _last_row (exclusive) and _next_row (exclusive) as continuous,
// provided that the underlying reader still matches the latest version of the partition.
// Invalidates _last_row.
void maybe_update_continuity();
// Tries to ensure that the lower bound of the current population range exists.
// Returns false if it failed and range cannot be populated.
@@ -165,12 +163,11 @@ public:
cache_flat_mutation_reader(const cache_flat_mutation_reader&) = delete;
cache_flat_mutation_reader(cache_flat_mutation_reader&&) = delete;
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override;
virtual future<> next_partition() override {
virtual void next_partition() override {
clear_buffer_to_next_partition();
if (is_buffer_empty()) {
_end_of_stream = true;
}
return make_ready_future<>();
}
virtual future<> fast_forward_to(const dht::partition_range&, db::timeout_clock::time_point timeout) override {
clear_buffer();
@@ -334,6 +331,7 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
}
if (_next_row_in_range) {
maybe_update_continuity();
_last_row = _next_row;
add_to_buffer(_next_row);
try {
move_to_next_entry();
@@ -346,14 +344,14 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
if (no_clustering_row_between(*_schema, _upper_bound, _next_row.position())) {
this->maybe_update_continuity();
} else if (can_populate()) {
rows_entry::tri_compare cmp(*_schema);
rows_entry::compare less(*_schema);
auto& rows = _snp->version()->partition().clustered_rows();
if (query::is_single_row(*_schema, *_ck_ranges_curr)) {
with_allocator(_snp->region().allocator(), [&] {
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(_ck_ranges_curr->start()->value()));
// Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.
auto insert_result = rows.insert_before_hint(_next_row.get_iterator_in_latest_version(), *e, cmp);
auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);
auto inserted = insert_result.second;
auto it = insert_result.first;
if (inserted) {
@@ -369,7 +367,7 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_schema, _upper_bound, is_dummy::yes, is_continuous::yes));
// Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.
auto insert_result = rows.insert_before_hint(_next_row.get_iterator_in_latest_version(), *e, cmp);
auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);
auto inserted = insert_result.second;
if (inserted) {
clogger.trace("csm {}: inserted dummy at {}", fmt::ptr(this), _upper_bound);
@@ -379,7 +377,6 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
clogger.trace("csm {}: mark {} as continuous", fmt::ptr(this), insert_result.first->position());
insert_result.first->set_continuous(true);
}
maybe_drop_last_entry();
});
}
} else {
@@ -410,12 +407,12 @@ bool cache_flat_mutation_reader::ensure_population_lower_bound() {
if (!_last_row.is_in_latest_version()) {
with_allocator(_snp->region().allocator(), [&] {
auto& rows = _snp->version()->partition().clustered_rows();
rows_entry::tri_compare cmp(*_schema);
rows_entry::compare less(*_schema);
// FIXME: Avoid the copy by inserting an incomplete clustering row
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_schema, *_last_row));
e->set_continuous(false);
auto insert_result = rows.insert_before_hint(rows.end(), *e, cmp);
auto insert_result = rows.insert_check(rows.end(), *e, less);
auto inserted = insert_result.second;
if (inserted) {
clogger.trace("csm {}: inserted lower bound dummy at {}", fmt::ptr(this), e->position());
@@ -433,7 +430,6 @@ void cache_flat_mutation_reader::maybe_update_continuity() {
with_allocator(_snp->region().allocator(), [&] {
rows_entry& e = _next_row.ensure_entry_in_latest().row;
e.set_continuous(true);
maybe_drop_last_entry();
});
} else {
_read_context->cache().on_mispopulate();
@@ -462,7 +458,7 @@ void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
clogger.trace("csm {}: populate({})", fmt::ptr(this), clustering_row::printer(*_schema, cr));
_lsa_manager.run_in_update_section_with_allocator([this, &cr] {
mutation_partition& mp = _snp->version()->partition();
rows_entry::tri_compare cmp(*_schema);
rows_entry::compare less(*_schema);
if (_read_context->digest_requested()) {
cr.cells().prepare_hash(*_schema, column_kind::regular_column);
@@ -471,8 +467,8 @@ void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
current_allocator().construct<rows_entry>(*_schema, cr.key(), cr.as_deletable_row()));
new_entry->set_continuous(false);
auto it = _next_row.iterators_valid() ? _next_row.get_iterator_in_latest_version()
: mp.clustered_rows().lower_bound(cr.key(), cmp);
auto insert_result = mp.clustered_rows().insert_before_hint(it, *new_entry, cmp);
: mp.clustered_rows().lower_bound(cr.key(), less);
auto insert_result = mp.clustered_rows().insert_check(it, *new_entry, less);
if (insert_result.second) {
_snp->tracker()->insert(*new_entry);
new_entry.release();
@@ -528,6 +524,7 @@ void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
// We add the row to the buffer even when it's full.
// This simplifies the code. For more info see #3139.
if (_next_row_in_range) {
_last_row = _next_row;
add_to_buffer(_next_row);
move_to_next_entry();
} else {
@@ -576,8 +573,8 @@ void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::con
clogger.trace("csm {}: insert dummy at {}", fmt::ptr(this), _lower_bound);
auto it = with_allocator(_lsa_manager.region().allocator(), [&] {
auto& rows = _snp->version()->partition().clustered_rows();
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(current_allocator().construct<rows_entry>(*_schema, _lower_bound, is_dummy::yes, is_continuous::no));
return rows.insert_before(_next_row.get_iterator_in_latest_version(), std::move(new_entry));
auto new_entry = current_allocator().construct<rows_entry>(*_schema, _lower_bound, is_dummy::yes, is_continuous::no);
return rows.insert_before(_next_row.get_iterator_in_latest_version(), *new_entry);
});
_snp->tracker()->insert(*it);
_last_row = partition_snapshot_row_weakref(*_snp, it, true);
@@ -589,38 +586,6 @@ void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::con
}
}
// Drops _last_row entry when possible without changing logical contents of the partition.
// Call only when _last_row and _next_row are valid.
// Calling after ensure_population_lower_bound() is ok.
// _next_row must have a greater position than _last_row.
// Invalidates references but keeps the _next_row valid.
inline
void cache_flat_mutation_reader::maybe_drop_last_entry() noexcept {
// Drop dummy entry if it falls inside a continuous range.
// This prevents unnecessary dummy entries from accumulating in cache and slowing down scans.
//
// Eviction can happen only from oldest versions to preserve the continuity non-overlapping rule
// (See docs/design-notes/row_cache.md)
//
if (_last_row
&& _last_row->dummy()
&& _last_row->continuous()
&& _snp->at_latest_version()
&& _snp->at_oldest_version()) {
with_allocator(_snp->region().allocator(), [&] {
_last_row->on_evicted(_read_context->cache()._tracker);
});
_last_row = nullptr;
// There could be iterators pointing to _last_row, invalidate them
_snp->region().allocator().invalidate_references();
// Don't invalidate _next_row, move_to_next_entry() expects it to be still valid.
_next_row.force_valid();
}
}
// _next_row must be inside the range.
inline
void cache_flat_mutation_reader::move_to_next_entry() {
@@ -628,18 +593,14 @@ void cache_flat_mutation_reader::move_to_next_entry() {
if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) {
move_to_next_range();
} else {
auto new_last_row = partition_snapshot_row_weakref(_next_row);
if (!_next_row.next()) {
move_to_end();
return;
}
_last_row = std::move(new_last_row);
_next_row_in_range = !after_current_range(_next_row.position());
clogger.trace("csm {}: next={}, cont={}, in_range={}", fmt::ptr(this), _next_row.position(), _next_row.continuous(), _next_row_in_range);
if (!_next_row.continuous()) {
start_reading_from_underlying();
} else {
maybe_drop_last_entry();
}
}
}
@@ -660,13 +621,6 @@ void cache_flat_mutation_reader::add_to_buffer(const partition_snapshot_row_curs
if (!row.dummy()) {
_read_context->cache().on_row_hit();
add_clustering_row_to_buffer(mutation_fragment(*_schema, _permit, row.row(_read_context->digest_requested())));
} else {
position_in_partition::less_compare less(*_schema);
if (less(_lower_bound, row.position())) {
_lower_bound = row.position();
_lower_bound_changed = true;
}
_read_context->cache()._tracker.on_dummy_row_hit();
}
}

View File

@@ -37,7 +37,7 @@
#include "idl/mutation.dist.impl.hh"
#include <iostream>
canonical_mutation::canonical_mutation(bytes_ostream data)
canonical_mutation::canonical_mutation(bytes data)
: _data(std::move(data))
{ }
@@ -45,7 +45,8 @@ canonical_mutation::canonical_mutation(const mutation& m)
{
mutation_partition_serializer part_ser(*m.schema(), m.partition());
ser::writer_of_canonical_mutation<bytes_ostream> wr(_data);
bytes_ostream out;
ser::writer_of_canonical_mutation<bytes_ostream> wr(out);
std::move(wr).write_table_id(m.schema()->id())
.write_schema_version(m.schema()->version())
.write_key(m.key())
@@ -53,6 +54,7 @@ canonical_mutation::canonical_mutation(const mutation& m)
.partition([&] (auto wr) {
part_ser.write(std::move(wr));
}).end_canonical_mutation();
_data = to_bytes(out.linearize());
}
utils::UUID canonical_mutation::column_family_id() const {

View File

@@ -32,9 +32,9 @@
// Safe to access from other shards via const&.
// Safe to pass serialized across nodes.
class canonical_mutation {
bytes_ostream _data;
bytes _data;
public:
explicit canonical_mutation(bytes_ostream);
explicit canonical_mutation(bytes);
explicit canonical_mutation(const mutation&);
canonical_mutation(canonical_mutation&&) = default;
@@ -51,7 +51,7 @@ public:
utils::UUID column_family_id() const;
const bytes_ostream& representation() const { return _data; }
const bytes& representation() const { return _data; }
friend std::ostream& operator<<(std::ostream& os, const canonical_mutation& cm);
};

View File

@@ -24,6 +24,7 @@
#include <unordered_set>
#include <algorithm>
#include <seastar/core/sleep.hh>
#include <algorithm>
#include <seastar/core/coroutine.hh>
#include "keys.hh"
@@ -40,7 +41,6 @@
#include "cdc/generation.hh"
#include "cdc/cdc_options.hh"
#include "cdc/generation_service.hh"
extern logging::logger cdc_log;
@@ -341,7 +341,8 @@ topology_description limit_number_of_streams_if_needed(topology_description&& de
return topology_description(std::move(entries));
}
future<db_clock::time_point> make_new_cdc_generation(
// Run inside seastar::async context.
db_clock::time_point make_new_cdc_generation(
const db::config& cfg,
const std::unordered_set<dht::token>& bootstrap_tokens,
const locator::token_metadata_ptr tmptr,
@@ -368,9 +369,9 @@ future<db_clock::time_point> make_new_cdc_generation(
auto ts = db_clock::now() + (
(!add_delay || ring_delay == milliseconds(0)) ? milliseconds(0) : (
2 * ring_delay + duration_cast<milliseconds>(generation_leeway)));
co_await sys_dist_ks.insert_cdc_topology_description(ts, std::move(gen), { tmptr->count_normal_token_owners() });
sys_dist_ks.insert_cdc_topology_description(ts, std::move(gen), { tmptr->count_normal_token_owners() }).get();
co_return ts;
return ts;
}
std::optional<db_clock::time_point> get_streams_timestamp_for(const gms::inet_address& endpoint, const gms::gossiper& g) {
@@ -399,35 +400,34 @@ static future<> do_update_streams_description(
cdc_log.info("CDC description table successfully updated with generation {}.", streams_ts);
}
future<> update_streams_description(
void update_streams_description(
db_clock::time_point streams_ts,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
try {
co_await do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
} catch (...) {
do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() }).get();
} catch(...) {
cdc_log.warn(
"Could not update CDC description table with generation {}: {}. Will retry in the background.",
streams_ts, std::current_exception());
// It is safe to discard this future: we keep system distributed keyspace alive.
(void)(([] (db_clock::time_point streams_ts,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) -> future<> {
(void)seastar::async([
streams_ts, sys_dist_ks, get_num_token_owners = std::move(get_num_token_owners), &abort_src
] {
while (true) {
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
sleep_abortable(std::chrono::seconds(60), abort_src).get();
try {
co_await do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
co_return;
do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() }).get();
return;
} catch (...) {
cdc_log.warn(
"Could not update CDC description table with generation {}: {}. Will try again.",
streams_ts, std::current_exception());
}
}
})(streams_ts, std::move(sys_dist_ks), std::move(get_num_token_owners), abort_src));
});
}
}
@@ -603,357 +603,4 @@ future<> maybe_rewrite_streams_descriptions(
})());
}
static void assert_shard_zero(const sstring& where) {
if (this_shard_id() != 0) {
on_internal_error(cdc_log, format("`{}`: must be run on shard 0", where));
}
}
class and_reducer {
private:
bool _result = true;
public:
future<> operator()(bool value) {
_result = value && _result;
return make_ready_future<>();
}
bool get() {
return _result;
}
};
class or_reducer {
private:
bool _result = false;
public:
future<> operator()(bool value) {
_result = value || _result;
return make_ready_future<>();
}
bool get() {
return _result;
}
};
class generation_handling_nonfatal_exception : public std::runtime_error {
using std::runtime_error::runtime_error;
};
constexpr char could_not_retrieve_msg_template[]
= "Could not retrieve CDC streams with timestamp {} upon gossip event. Reason: \"{}\". Action: {}.";
generation_service::generation_service(
const db::config& cfg, gms::gossiper& g, sharded<db::system_distributed_keyspace>& sys_dist_ks,
abort_source& abort_src, const locator::shared_token_metadata& stm)
: _cfg(cfg), _gossiper(g), _sys_dist_ks(sys_dist_ks), _abort_src(abort_src), _token_metadata(stm) {
}
future<> generation_service::stop() {
if (this_shard_id() == 0) {
co_await _gossiper.unregister_(shared_from_this());
}
_stopped = true;
}
generation_service::~generation_service() {
assert(_stopped);
}
future<> generation_service::after_join(std::optional<db_clock::time_point>&& startup_gen_ts) {
assert_shard_zero(__PRETTY_FUNCTION__);
assert(db::system_keyspace::bootstrap_complete());
_gen_ts = std::move(startup_gen_ts);
_gossiper.register_(shared_from_this());
_joined = true;
// Retrieve the latest CDC generation seen in gossip (if any).
co_await scan_cdc_generations();
}
void generation_service::on_join(gms::inet_address ep, gms::endpoint_state ep_state) {
assert_shard_zero(__PRETTY_FUNCTION__);
auto val = ep_state.get_application_state_ptr(gms::application_state::CDC_STREAMS_TIMESTAMP);
if (!val) {
return;
}
on_change(ep, gms::application_state::CDC_STREAMS_TIMESTAMP, *val);
}
void generation_service::on_change(gms::inet_address ep, gms::application_state app_state, const gms::versioned_value& v) {
assert_shard_zero(__PRETTY_FUNCTION__);
if (app_state != gms::application_state::CDC_STREAMS_TIMESTAMP) {
return;
}
auto ts = gms::versioned_value::cdc_streams_timestamp_from_string(v.value);
cdc_log.debug("Endpoint: {}, CDC generation timestamp change: {}", ep, ts);
handle_cdc_generation(ts).get();
}
future<> generation_service::check_and_repair_cdc_streams() {
if (!_joined) {
throw std::runtime_error("check_and_repair_cdc_streams: node not initialized yet");
}
auto latest = _gen_ts;
const auto& endpoint_states = _gossiper.get_endpoint_states();
for (const auto& [addr, state] : endpoint_states) {
if (!_gossiper.is_normal(addr)) {
throw std::runtime_error(format("All nodes must be in NORMAL state while performing check_and_repair_cdc_streams"
" ({} is in state {})", addr, _gossiper.get_gossip_status(state)));
}
const auto ts = get_streams_timestamp_for(addr, _gossiper);
if (!latest || (ts && *ts > *latest)) {
latest = ts;
}
}
bool should_regenerate = false;
std::optional<topology_description> gen;
static const auto timeout_msg = "Timeout while fetching CDC topology description";
static const auto topology_read_error_note = "Note: this is likely caused by"
" node(s) being down or unreachable. It is recommended to check the network and"
" restart/remove the failed node(s), then retry checkAndRepairCdcStreams command";
static const auto exception_translating_msg = "Translating the exception to `request_execution_exception`";
const auto tmptr = _token_metadata.get();
auto sys_dist_ks = get_sys_dist_ks();
try {
gen = co_await sys_dist_ks->read_cdc_topology_description(
*latest, { tmptr->count_normal_token_owners() });
} catch (exceptions::request_timeout_exception& e) {
cdc_log.error("{}: \"{}\". {}.", timeout_msg, e.what(), exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::READ_TIMEOUT,
format("{}. {}.", timeout_msg, topology_read_error_note));
} catch (exceptions::unavailable_exception& e) {
static const auto unavailable_msg = "Node(s) unavailable while fetching CDC topology description";
cdc_log.error("{}: \"{}\". {}.", unavailable_msg, e.what(), exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::UNAVAILABLE,
format("{}. {}.", unavailable_msg, topology_read_error_note));
} catch (...) {
const auto ep = std::current_exception();
if (is_timeout_exception(ep)) {
cdc_log.error("{}: \"{}\". {}.", timeout_msg, ep, exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::READ_TIMEOUT,
format("{}. {}.", timeout_msg, topology_read_error_note));
}
// On exotic errors proceed with regeneration
cdc_log.error("Exception while reading CDC topology description: \"{}\". Regenerating streams anyway.", ep);
should_regenerate = true;
}
if (!gen) {
cdc_log.error(
"Could not find CDC generation with timestamp {} in distributed system tables (current time: {}),"
" even though some node gossiped about it.",
latest, db_clock::now());
should_regenerate = true;
} else {
std::unordered_set<dht::token> gen_ends;
for (const auto& entry : gen->entries()) {
gen_ends.insert(entry.token_range_end);
}
for (const auto& metadata_token : tmptr->sorted_tokens()) {
if (!gen_ends.contains(metadata_token)) {
cdc_log.warn("CDC generation {} missing token {}. Regenerating.", latest, metadata_token);
should_regenerate = true;
break;
}
}
}
if (!should_regenerate) {
if (latest != _gen_ts) {
co_await do_handle_cdc_generation(*latest);
}
cdc_log.info("CDC generation {} does not need repair", latest);
co_return;
}
const auto new_gen_ts = co_await make_new_cdc_generation(_cfg,
{}, std::move(tmptr), _gossiper, *sys_dist_ks,
std::chrono::milliseconds(_cfg.ring_delay_ms()), true /* add delay */);
// Need to artificially update our STATUS so other nodes handle the timestamp change
auto status = _gossiper.get_application_state_ptr(
utils::fb_utilities::get_broadcast_address(), gms::application_state::STATUS);
if (!status) {
cdc_log.error("Our STATUS is missing");
cdc_log.error("Aborting CDC generation repair due to missing STATUS");
co_return;
}
// Update _gen_ts first, so that do_handle_cdc_generation (which will get called due to the status update)
// won't try to update the gossiper, which would result in a deadlock inside add_local_application_state
_gen_ts = new_gen_ts;
co_await _gossiper.add_local_application_state({
{ gms::application_state::CDC_STREAMS_TIMESTAMP, gms::versioned_value::cdc_streams_timestamp(new_gen_ts) },
{ gms::application_state::STATUS, *status }
});
co_await db::system_keyspace::update_cdc_streams_timestamp(new_gen_ts);
}
future<> generation_service::handle_cdc_generation(std::optional<db_clock::time_point> ts) {
assert_shard_zero(__PRETTY_FUNCTION__);
if (!ts) {
co_return;
}
if (!db::system_keyspace::bootstrap_complete() || !_sys_dist_ks.local_is_initialized()
|| !_sys_dist_ks.local().started()) {
// The service should not be listening for generation changes until after the node
// is bootstrapped. Therefore we would previously assume that this condition
// can never become true and call on_internal_error here, but it turns out that
// it may become true on decommission: the node enters NEEDS_BOOTSTRAP
// state before leaving the token ring, so bootstrap_complete() becomes false.
// In that case we can simply return.
co_return;
}
if (co_await container().map_reduce(and_reducer(), [ts = *ts] (generation_service& svc) {
return !svc._cdc_metadata.prepare(ts);
})) {
co_return;
}
bool using_this_gen = false;
try {
using_this_gen = co_await do_handle_cdc_generation_intercept_nonfatal_errors(*ts);
} catch (generation_handling_nonfatal_exception& e) {
cdc_log.warn(could_not_retrieve_msg_template, ts, e.what(), "retrying in the background");
async_handle_cdc_generation(*ts);
co_return;
} catch (...) {
cdc_log.error(could_not_retrieve_msg_template, ts, std::current_exception(), "not retrying");
co_return; // Exotic ("fatal") exception => do not retry
}
if (using_this_gen) {
cdc_log.info("Starting to use generation {}", *ts);
co_await update_streams_description(*ts, get_sys_dist_ks(),
[tmptr = _token_metadata.get()] { return tmptr->count_normal_token_owners(); },
_abort_src);
}
}
void generation_service::async_handle_cdc_generation(db_clock::time_point ts) {
assert_shard_zero(__PRETTY_FUNCTION__);
(void)(([] (db_clock::time_point ts, shared_ptr<generation_service> svc) -> future<> {
while (true) {
co_await sleep_abortable(std::chrono::seconds(5), svc->_abort_src);
try {
bool using_this_gen = co_await svc->do_handle_cdc_generation_intercept_nonfatal_errors(ts);
if (using_this_gen) {
cdc_log.info("Starting to use generation {}", ts);
co_await update_streams_description(ts, svc->get_sys_dist_ks(),
[tmptr = svc->_token_metadata.get()] { return tmptr->count_normal_token_owners(); },
svc->_abort_src);
}
co_return;
} catch (generation_handling_nonfatal_exception& e) {
cdc_log.warn(could_not_retrieve_msg_template, ts, e.what(), "continuing to retry in the background");
} catch (...) {
cdc_log.error(could_not_retrieve_msg_template, ts, std::current_exception(), "not retrying anymore");
co_return; // Exotic ("fatal") exception => do not retry
}
if (co_await svc->container().map_reduce(and_reducer(), [ts] (generation_service& svc) {
return svc._cdc_metadata.known_or_obsolete(ts);
})) {
co_return;
}
}
})(ts, shared_from_this()));
}
future<> generation_service::scan_cdc_generations() {
assert_shard_zero(__PRETTY_FUNCTION__);
std::optional<db_clock::time_point> latest;
for (const auto& ep: _gossiper.get_endpoint_states()) {
auto ts = get_streams_timestamp_for(ep.first, _gossiper);
if (!latest || (ts && *ts > *latest)) {
latest = ts;
}
}
if (latest) {
cdc_log.info("Latest generation seen during startup: {}", *latest);
co_await handle_cdc_generation(latest);
} else {
cdc_log.info("No generation seen during startup.");
}
}
future<bool> generation_service::do_handle_cdc_generation_intercept_nonfatal_errors(db_clock::time_point ts) {
assert_shard_zero(__PRETTY_FUNCTION__);
try {
co_return co_await do_handle_cdc_generation(ts);
} catch (exceptions::request_timeout_exception& e) {
throw generation_handling_nonfatal_exception(e.what());
} catch (exceptions::unavailable_exception& e) {
throw generation_handling_nonfatal_exception(e.what());
} catch (exceptions::read_failure_exception& e) {
throw generation_handling_nonfatal_exception(e.what());
} catch (...) {
const auto ep = std::current_exception();
if (is_timeout_exception(ep)) {
throw generation_handling_nonfatal_exception(format("{}", ep));
}
throw;
}
}
future<bool> generation_service::do_handle_cdc_generation(db_clock::time_point ts) {
assert_shard_zero(__PRETTY_FUNCTION__);
auto sys_dist_ks = get_sys_dist_ks();
auto gen = co_await sys_dist_ks->read_cdc_topology_description(
ts, { _token_metadata.get()->count_normal_token_owners() });
if (!gen) {
throw std::runtime_error(format(
"Could not find CDC generation with timestamp {} in distributed system tables (current time: {}),"
" even though some node gossiped about it.",
ts, db_clock::now()));
}
// If we're not gossiping our own generation timestamp (because we've upgraded from a non-CDC/old version,
// or we somehow lost it due to a byzantine failure), start gossiping someone else's timestamp.
// This is to avoid the upgrade check on every restart (see `should_propose_first_cdc_generation`).
// And if we notice that `ts` is higher than our timestamp, we will start gossiping it instead,
// so if the node that initially gossiped `ts` leaves the cluster while `ts` is still the latest generation,
// the cluster will remember.
if (!_gen_ts || *_gen_ts < ts) {
_gen_ts = ts;
co_await db::system_keyspace::update_cdc_streams_timestamp(ts);
co_await _gossiper.add_local_application_state(
gms::application_state::CDC_STREAMS_TIMESTAMP, gms::versioned_value::cdc_streams_timestamp(ts));
}
// Return `true` iff the generation was inserted on any of our shards.
co_return co_await container().map_reduce(or_reducer(), [ts, &gen] (generation_service& svc) {
auto gen_ = *gen;
return svc._cdc_metadata.insert(ts, std::move(gen_));
});
}
shared_ptr<db::system_distributed_keyspace> generation_service::get_sys_dist_ks() {
assert_shard_zero(__PRETTY_FUNCTION__);
if (!_sys_dist_ks.local_is_initialized()) {
throw std::runtime_error("system distributed keyspace not initialized");
}
return _sys_dist_ks.local_shared();
}
} // namespace cdc

View File

@@ -167,7 +167,7 @@ future<db_clock::time_point> get_local_streams_timestamp();
* (not guaranteed in the current implementation, but expected to be the common case;
* we assume that `ring_delay` is enough for other nodes to learn about the new generation).
*/
future<db_clock::time_point> make_new_cdc_generation(
db_clock::time_point make_new_cdc_generation(
const db::config& cfg,
const std::unordered_set<dht::token>& bootstrap_tokens,
const locator::token_metadata_ptr tmptr,
@@ -190,8 +190,10 @@ std::optional<db_clock::time_point> get_streams_timestamp_for(const gms::inet_ad
*
* Returning from this function does not mean that the table update was successful: the function
* might run an asynchronous task in the background.
*
* Run inside seastar::async context.
*/
future<> update_streams_description(
void update_streams_description(
db_clock::time_point,
shared_ptr<db::system_distributed_keyspace>,
noncopyable_function<unsigned()> get_num_token_owners,

View File

@@ -1,138 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* Modified by ScyllaDB
* Copyright (C) 2021 ScyllaDB
*
*/
#pragma once
#include "cdc/metadata.hh"
#include "gms/i_endpoint_state_change_subscriber.hh"
namespace db {
class system_distributed_keyspace;
}
namespace gms {
class gossiper;
}
namespace cdc {
class generation_service : public peering_sharded_service<generation_service>
, public async_sharded_service<generation_service>
, public gms::i_endpoint_state_change_subscriber {
bool _stopped = false;
// The node has joined the token ring. Set to `true` on `after_join` call.
bool _joined = false;
const db::config& _cfg;
gms::gossiper& _gossiper;
sharded<db::system_distributed_keyspace>& _sys_dist_ks;
abort_source& _abort_src;
const locator::shared_token_metadata& _token_metadata;
/* Maintains the set of known CDC generations used to pick streams for log writes (i.e., the partition keys of these log writes).
* Updated in response to certain gossip events (see the handle_cdc_generation function).
*/
cdc::metadata _cdc_metadata;
/* The latest known generation timestamp and the timestamp that we're currently gossiping
* (as CDC_STREAMS_TIMESTAMP application state).
*
* Only shard 0 manages this, hence it will be std::nullopt on all shards other than 0.
* This timestamp is also persisted in the system.cdc_local table.
*
* On shard 0 this may be nullopt only in one special case: rolling upgrade, when we upgrade
* from an old version of Scylla that didn't support CDC. In that case one node in the cluster
* will create the first generation and start gossiping it; it may be us, or it may be some
* different node. In any case, eventually - after one of the nodes gossips the first timestamp
* - we'll catch on and this variable will be updated with that generation.
*/
std::optional<db_clock::time_point> _gen_ts;
public:
generation_service(const db::config&, gms::gossiper&,
sharded<db::system_distributed_keyspace>&, abort_source&, const locator::shared_token_metadata&);
future<> stop();
~generation_service();
/* After the node bootstraps and creates a new CDC generation, or restarts and loads the last
* known generation timestamp from persistent storage, this function should be called with
* that generation timestamp moved in as the `startup_gen_ts` parameter.
* This passes the responsibility of managing generations from the node startup code to this service;
* until then, the service remains dormant.
* At the time of writing this comment, the startup code is in `storage_service::join_token_ring`, hence
* `after_join` should be called at the end of that function.
* Precondition: the node has completed bootstrapping and system_distributed_keyspace is initialized.
* Must be called on shard 0 - that's where the generation management happens.
*/
future<> after_join(std::optional<db_clock::time_point>&& startup_gen_ts);
cdc::metadata& get_cdc_metadata() {
return _cdc_metadata;
}
virtual void before_change(gms::inet_address, gms::endpoint_state, gms::application_state, const gms::versioned_value&) override {}
virtual void on_alive(gms::inet_address, gms::endpoint_state) override {}
virtual void on_dead(gms::inet_address, gms::endpoint_state) override {}
virtual void on_remove(gms::inet_address) override {}
virtual void on_restart(gms::inet_address, gms::endpoint_state) override {}
virtual void on_join(gms::inet_address, gms::endpoint_state) override;
virtual void on_change(gms::inet_address, gms::application_state, const gms::versioned_value&) override;
future<> check_and_repair_cdc_streams();
private:
/* Retrieve the CDC generation which starts at the given timestamp (from a distributed table created for this purpose)
* and start using it for CDC log writes if it's not obsolete.
*/
future<> handle_cdc_generation(std::optional<db_clock::time_point>);
/* If `handle_cdc_generation` fails, it schedules an asynchronous retry in the background
* using `async_handle_cdc_generation`.
*/
void async_handle_cdc_generation(db_clock::time_point);
/* Wrapper around `do_handle_cdc_generation` which intercepts timeout/unavailability exceptions.
* Returns: do_handle_cdc_generation(ts). */
future<bool> do_handle_cdc_generation_intercept_nonfatal_errors(db_clock::time_point);
/* Returns `true` iff we started using the generation (it was not obsolete or already known),
* which means that this node might write some CDC log entries using streams from this generation. */
future<bool> do_handle_cdc_generation(db_clock::time_point);
/* Scan CDC generation timestamps gossiped by other nodes and retrieve the latest one.
* This function should be called once at the end of the node startup procedure
* (after the node is started and running normally, it will retrieve generations on gossip events instead).
*/
future<> scan_cdc_generations();
/* generation_service code might be racing with system_distributed_keyspace deinitialization
* (the deinitialization order is broken).
* Therefore, whenever we want to access sys_dist_ks in a background task,
* we need to check if the instance is still there. Storing the shared pointer will keep it alive.
*/
shared_ptr<db::system_distributed_keyspace> get_sys_dist_ks();
};
} // namespace cdc

View File

@@ -32,7 +32,6 @@
#include "cdc/split.hh"
#include "cdc/cdc_options.hh"
#include "cdc/change_visitor.hh"
#include "cdc/metadata.hh"
#include "bytes.hh"
#include "database.hh"
#include "db/config.hh"
@@ -49,9 +48,6 @@
#include "cql3/untyped_result_set.hh"
#include "log.hh"
#include "utils/rjson.hh"
#include "utils/UUID_gen.hh"
#include "utils/managed_bytes.hh"
#include "utils/fragment_range.hh"
#include "types.hh"
#include "concrete_types.hh"
#include "types/listlike_partial_deserializing_iterator.hh"
@@ -74,7 +70,7 @@ using namespace std::chrono_literals;
logging::logger cdc_log("cdc");
namespace cdc {
static schema_ptr create_log_schema(const schema&, std::optional<utils::UUID> = {}, schema_ptr = nullptr);
static schema_ptr create_log_schema(const schema&, std::optional<utils::UUID> = {});
}
static constexpr auto cdc_group_name = "cdc";
@@ -221,7 +217,7 @@ public:
return;
}
auto new_log_schema = create_log_schema(new_schema, log_schema ? std::make_optional(log_schema->id()) : std::nullopt, log_schema);
auto new_log_schema = create_log_schema(new_schema, log_schema ? std::make_optional(log_schema->id()) : std::nullopt);
auto log_mut = log_schema
? db::schema_tables::make_update_table_mutations(db, keyspace.metadata(), log_schema, new_log_schema, timestamp, false)
@@ -281,8 +277,8 @@ private:
}
};
cdc::cdc_service::cdc_service(service::storage_proxy& proxy, cdc::metadata& cdc_metadata)
: cdc_service(db_context::builder(proxy, cdc_metadata).build())
cdc::cdc_service::cdc_service(service::storage_proxy& proxy)
: cdc_service(db_context::builder(proxy).build())
{}
cdc::cdc_service::cdc_service(db_context ctxt)
@@ -490,7 +486,7 @@ bytes log_data_column_deleted_elements_name_bytes(const bytes& column_name) {
return to_bytes(cdc_deleted_elements_column_prefix) + column_name;
}
static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID> uuid, schema_ptr old) {
static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID> uuid) {
schema_builder b(s.ks_name(), log_name(s.cf_name()));
b.with_partitioner("com.scylladb.dht.CDCPartitioner");
b.set_compaction_strategy(sstables::compaction_strategy_type::time_window);
@@ -571,25 +567,11 @@ static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID>
b.set_uuid(*uuid);
}
/**
* #10473 - if we are redefining the log table, we need to ensure any dropped
* columns are registered in "dropped_columns" table, otherwise clients will not
* be able to read data older than now.
*/
if (old) {
// not super efficient, but we don't do this often.
for (auto& col : old->all_columns()) {
if (!b.has_column({col.name(), col.name_as_text() })) {
b.without_column(col.name_as_text(), col.type, api::new_timestamp());
}
}
}
return b.build();
}
db_context::builder::builder(service::storage_proxy& proxy, cdc::metadata& cdc_metadata)
: _proxy(proxy), _cdc_metadata(cdc_metadata)
db_context::builder::builder(service::storage_proxy& proxy)
: _proxy(proxy)
{}
db_context::builder& db_context::builder::with_migration_notifier(service::migration_notifier& migration_notifier) {
@@ -597,11 +579,16 @@ db_context::builder& db_context::builder::with_migration_notifier(service::migra
return *this;
}
db_context::builder& db_context::builder::with_cdc_metadata(cdc::metadata& cdc_metadata) {
_cdc_metadata = cdc_metadata;
return *this;
}
db_context db_context::builder::build() {
return db_context{
_proxy,
_migration_notifier ? _migration_notifier->get() : service::get_local_storage_service().get_migration_notifier(),
_cdc_metadata,
_cdc_metadata ? _cdc_metadata->get() : service::get_local_storage_service().get_cdc_metadata(),
};
}
@@ -679,14 +666,6 @@ void collection_iterator<bytes_view>::parse() {
_current = k;
}
template<>
void collection_iterator<managed_bytes_view>::parse() {
assert(_rem > 0);
_next = _v;
auto k = read_collection_value(_next, cql_serialization_format::internal());
_current = k;
}
template<typename Container, typename T>
class maybe_back_insert_iterator : public std::back_insert_iterator<Container> {
const abstract_type& _type;
@@ -787,18 +766,18 @@ static bytes merge(const set_type_impl& ctype, const bytes_opt& prev, const byte
return set_type_impl::serialize_partially_deserialized_form(res, cql_serialization_format::internal());
}
static bytes merge(const user_type_impl& type, const bytes_opt& prev, const bytes_opt& next, const bytes_opt& deleted) {
std::vector<managed_bytes_view_opt> res(type.size());
udt_for_each(prev, [&res, i = res.begin()](managed_bytes_view_opt k) mutable {
std::vector<bytes_view_opt> res(type.size());
udt_for_each(prev, [&res, i = res.begin()](bytes_view_opt k) mutable {
*i++ = k;
});
udt_for_each(next, [&res, i = res.begin()](managed_bytes_view_opt k) mutable {
udt_for_each(next, [&res, i = res.begin()](bytes_view_opt k) mutable {
if (k) {
*i = k;
}
++i;
});
collection_iterator<managed_bytes_view> e, d(deleted);
std::for_each(d, e, [&res](managed_bytes_view k) {
collection_iterator<bytes_view> e, d(deleted);
std::for_each(d, e, [&res](bytes_view k) {
auto index = deserialize_field_index(k);
res[index] = std::nullopt;
});
@@ -837,13 +816,13 @@ static bytes_opt get_preimage_col_value(const column_definition& cdef, const cql
auto v = pirow->get_view(cdef.name_as_text());
auto f = cql_serialization_format::internal();
auto n = read_collection_size(v, f);
std::vector<bytes> tmp;
std::vector<bytes_view> tmp;
tmp.reserve(n);
while (n--) {
tmp.emplace_back(read_collection_value(v, f).linearize()); // key
tmp.emplace_back(read_collection_value(v, f)); // key
read_collection_value(v, f); // value. ignore.
}
return set_type_impl::serialize_partially_deserialized_form({tmp.begin(), tmp.end()}, f);
return set_type_impl::serialize_partially_deserialized_form(tmp, f);
},
[&] (const abstract_type& o) -> bytes {
return pirow->get_blob(cdef.name_as_text());
@@ -999,13 +978,13 @@ private:
};
static bytes get_bytes(const atomic_cell_view& acv) {
return to_bytes(acv.value());
return acv.value().linearize();
}
static bytes_view get_bytes_view(const atomic_cell_view& acv, std::forward_list<bytes>& buf) {
return acv.value().is_fragmented()
? bytes_view{buf.emplace_front(to_bytes(acv.value()))}
: acv.value().current_fragment();
? bytes_view{buf.emplace_front(acv.value().linearize())}
: acv.value().first_fragment();
}
static ttl_opt get_ttl(const atomic_cell_view& acv) {
@@ -1158,7 +1137,7 @@ struct process_row_visitor {
_touched_parts.set<stats::part_type::UDT>();
struct udt_visitor : public collection_visitor {
std::vector<bytes_view_opt> _added_cells;
std::vector<bytes_opt> _added_cells;
std::forward_list<bytes>& _buf;
udt_visitor(ttl_opt& ttl_column, size_t num_keys, std::forward_list<bytes>& buf)
@@ -1670,7 +1649,13 @@ public:
try {
return _ctx._proxy.query(_schema, std::move(command), std::move(partition_ranges), select_cl, service::storage_proxy::coordinator_query_options(default_timeout(), empty_service_permit(), client_state)).then(
[s = _schema, partition_slice = std::move(partition_slice), selection = std::move(selection)] (service::storage_proxy::coordinator_query_result qr) -> lw_shared_ptr<cql3::untyped_result_set> {
return make_lw_shared<cql3::untyped_result_set>(*s, std::move(qr.query_result), *selection, partition_slice);
cql3::selection::result_set_builder builder(*selection, gc_clock::now(), cql_serialization_format::latest());
query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *s, *selection));
auto result_set = builder.build();
if (!result_set || result_set->empty()) {
return {};
}
return make_lw_shared<cql3::untyped_result_set>(*result_set);
});
} catch (exceptions::unavailable_exception& e) {
// `query` can throw `unavailable_exception`, which is seen by clients as ~ "NoHostAvailable".
@@ -1714,7 +1699,7 @@ public:
// as there will be no clustering row data to load into the state.
return;
}
ck_parts.emplace_back(v->linearize());
ck_parts.emplace_back(*v);
}
auto ck = clustering_key::from_exploded(std::move(ck_parts));

View File

@@ -80,7 +80,7 @@ class cdc_service final : public async_sharded_service<cdc::cdc_service> {
std::unique_ptr<impl> _impl;
public:
future<> stop();
cdc_service(service::storage_proxy&, cdc::metadata&);
cdc_service(service::storage_proxy&);
cdc_service(db_context);
~cdc_service();
@@ -104,12 +104,13 @@ struct db_context final {
class builder final {
service::storage_proxy& _proxy;
cdc::metadata& _cdc_metadata;
std::optional<std::reference_wrapper<service::migration_notifier>> _migration_notifier;
std::optional<std::reference_wrapper<cdc::metadata>> _cdc_metadata;
public:
builder(service::storage_proxy& proxy, cdc::metadata&);
builder(service::storage_proxy& proxy);
builder& with_migration_notifier(service::migration_notifier& migration_notifier);
builder& with_cdc_metadata(cdc::metadata&);
db_context build();
};

View File

@@ -31,7 +31,10 @@ class checked_file_impl : public file_impl {
public:
checked_file_impl(const io_error_handler& error_handler, file f)
: file_impl(*get_file_impl(f)), _error_handler(error_handler), _file(f) {
: _error_handler(error_handler), _file(f) {
_memory_dma_alignment = f.memory_dma_alignment();
_disk_read_dma_alignment = f.disk_read_dma_alignment();
_disk_write_dma_alignment = f.disk_write_dma_alignment();
}
virtual future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) override {

View File

@@ -65,11 +65,6 @@ private:
_current_start = position_in_partition_view::for_range_start(_current_range.front());
_current_end = position_in_partition_view::for_range_end(_current_range.front());
}
} else {
// If the first range is contiguous with the static row, then advance _current_end as much as we can
if (_current_range && !_current_range.front().start()) {
_current_end = position_in_partition_view::for_range_end(_current_range.front());
}
}
}

View File

@@ -22,6 +22,7 @@
#include "types/collection.hh"
#include "types/user.hh"
#include "concrete_types.hh"
#include "atomic_cell_or_collection.hh"
#include "mutation_partition.hh"
#include "compaction_garbage_collector.hh"
#include "combine.hh"
@@ -29,28 +30,40 @@
#include "collection_mutation.hh"
collection_mutation::collection_mutation(const abstract_type& type, collection_mutation_view v)
: _data(v.data) {}
: _data(imr_object_type::make(data::cell::make_collection(v.data), &type.imr_state().lsa_migrator())) {}
collection_mutation::collection_mutation(const abstract_type& type, managed_bytes data)
: _data(std::move(data)) {}
collection_mutation::collection_mutation(const abstract_type& type, const bytes_ostream& data)
: _data(imr_object_type::make(data::cell::make_collection(fragment_range_view(data)), &type.imr_state().lsa_migrator())) {}
static collection_mutation_view get_collection_mutation_view(const uint8_t* ptr)
{
auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);
auto ti = data::type_info::make_collection();
data::cell::context ctx(f, ti);
auto view = data::cell::structure::get_member<data::cell::tags::cell>(ptr).as<data::cell::tags::collection>(ctx);
auto dv = data::cell::variable_value::make_view(view, f.get<data::cell::tags::external_data>());
return collection_mutation_view { dv };
}
collection_mutation::operator collection_mutation_view() const
{
return collection_mutation_view{managed_bytes_view(_data)};
return get_collection_mutation_view(_data.get());
}
collection_mutation_view atomic_cell_or_collection::as_collection_mutation() const {
return collection_mutation_view{managed_bytes_view(_data)};
return get_collection_mutation_view(_data.get());
}
bool collection_mutation_view::is_empty() const {
auto in = collection_mutation_input_stream(fragment_range(data));
auto in = collection_mutation_input_stream(data);
auto has_tomb = in.read_trivial<bool>();
return !has_tomb && in.read_trivial<uint32_t>() == 0;
}
bool collection_mutation_view::is_any_live(const abstract_type& type, tombstone tomb, gc_clock::time_point now) const {
auto in = collection_mutation_input_stream(fragment_range(data));
template <typename F>
requires std::is_invocable_r_v<const data::type_info&, F, collection_mutation_input_stream&>
static bool is_any_live(const atomic_cell_value_view& data, tombstone tomb, gc_clock::time_point now, F&& read_cell_type_info) {
auto in = collection_mutation_input_stream(data);
auto has_tomb = in.read_trivial<bool>();
if (has_tomb) {
auto ts = in.read_trivial<api::timestamp_type>();
@@ -60,10 +73,9 @@ bool collection_mutation_view::is_any_live(const abstract_type& type, tombstone
auto nr = in.read_trivial<uint32_t>();
for (uint32_t i = 0; i != nr; ++i) {
auto key_size = in.read_trivial<uint32_t>();
in.skip(key_size);
auto& type_info = read_cell_type_info(in);
auto vsize = in.read_trivial<uint32_t>();
auto value = atomic_cell_view::from_bytes(type, in.read(vsize));
auto value = atomic_cell_view::from_bytes(type_info, in.read(vsize));
if (value.is_live(tomb, now, false)) {
return true;
}
@@ -72,8 +84,33 @@ bool collection_mutation_view::is_any_live(const abstract_type& type, tombstone
return false;
}
api::timestamp_type collection_mutation_view::last_update(const abstract_type& type) const {
auto in = collection_mutation_input_stream(fragment_range(data));
bool collection_mutation_view::is_any_live(const abstract_type& type, tombstone tomb, gc_clock::time_point now) const {
return visit(type, make_visitor(
[&] (const collection_type_impl& ctype) {
auto& type_info = ctype.value_comparator()->imr_state().type_info();
return ::is_any_live(data, tomb, now, [&type_info] (collection_mutation_input_stream& in) -> const data::type_info& {
auto key_size = in.read_trivial<uint32_t>();
in.skip(key_size);
return type_info;
});
},
[&] (const user_type_impl& utype) {
return ::is_any_live(data, tomb, now, [&utype] (collection_mutation_input_stream& in) -> const data::type_info& {
auto key_size = in.read_trivial<uint32_t>();
auto key = in.read(key_size);
return utype.type(deserialize_field_index(key))->imr_state().type_info();
});
},
[&] (const abstract_type& o) -> bool {
throw std::runtime_error(format("collection_mutation_view::is_any_live: unknown type {}", o.name()));
}
));
}
template <typename F>
requires std::is_invocable_r_v<const data::type_info&, F, collection_mutation_input_stream&>
static api::timestamp_type last_update(const atomic_cell_value_view& data, F&& read_cell_type_info) {
auto in = collection_mutation_input_stream(data);
api::timestamp_type max = api::missing_timestamp;
auto has_tomb = in.read_trivial<bool>();
if (has_tomb) {
@@ -83,16 +120,39 @@ api::timestamp_type collection_mutation_view::last_update(const abstract_type& t
auto nr = in.read_trivial<uint32_t>();
for (uint32_t i = 0; i != nr; ++i) {
const auto key_size = in.read_trivial<uint32_t>();
in.skip(key_size);
auto& type_info = read_cell_type_info(in);
auto vsize = in.read_trivial<uint32_t>();
auto value = atomic_cell_view::from_bytes(type, in.read(vsize));
auto value = atomic_cell_view::from_bytes(type_info, in.read(vsize));
max = std::max(value.timestamp(), max);
}
return max;
}
api::timestamp_type collection_mutation_view::last_update(const abstract_type& type) const {
return visit(type, make_visitor(
[&] (const collection_type_impl& ctype) {
auto& type_info = ctype.value_comparator()->imr_state().type_info();
return ::last_update(data, [&type_info] (collection_mutation_input_stream& in) -> const data::type_info& {
auto key_size = in.read_trivial<uint32_t>();
in.skip(key_size);
return type_info;
});
},
[&] (const user_type_impl& utype) {
return ::last_update(data, [&utype] (collection_mutation_input_stream& in) -> const data::type_info& {
auto key_size = in.read_trivial<uint32_t>();
auto key = in.read(key_size);
return utype.type(deserialize_field_index(key))->imr_state().type_info();
});
},
[&] (const abstract_type& o) -> api::timestamp_type {
throw std::runtime_error(format("collection_mutation_view::last_update: unknown type {}", o.name()));
}
));
}
std::ostream& operator<<(std::ostream& os, const collection_mutation_view::printer& cmvp) {
fmt::print(os, "{{collection_mutation_view ");
cmvp._cmv.with_deserialized(cmvp._type, [&os, &type = cmvp._type] (const collection_mutation_view_description& cmvd) {
@@ -218,31 +278,28 @@ static collection_mutation serialize_collection_mutation(
auto size = accumulate(cells, (size_t)4, element_size);
size += 1;
if (tomb) {
size += sizeof(int64_t) + sizeof(int64_t);
size += sizeof(tomb.timestamp) + sizeof(tomb.deletion_time);
}
managed_bytes ret(managed_bytes::initialized_later(), size);
managed_bytes_mutable_view out(ret);
write<uint8_t>(out, uint8_t(bool(tomb)));
bytes_ostream ret;
ret.reserve(size);
auto out = ret.write_begin();
*out++ = bool(tomb);
if (tomb) {
write<int64_t>(out, tomb.timestamp);
write<int64_t>(out, tomb.deletion_time.time_since_epoch().count());
write(out, tomb.timestamp);
write(out, tomb.deletion_time.time_since_epoch().count());
}
auto writek = [&out] (bytes_view v) {
write<int32_t>(out, v.size());
write_fragmented(out, single_fragmented_view(v));
};
auto writev = [&out] (managed_bytes_view v) {
write<int32_t>(out, v.size());
write_fragmented(out, v);
auto writeb = [&out] (bytes_view v) {
serialize_int32(out, v.size());
out = std::copy_n(v.begin(), v.size(), out);
};
// FIXME: overflow?
write<int32_t>(out, boost::distance(cells));
serialize_int32(out, boost::distance(cells));
for (auto&& kv : cells) {
auto&& k = kv.first;
auto&& v = kv.second;
writek(k);
writeb(k);
writev(v.serialize());
writeb(v.serialize());
}
return collection_mutation(type, ret);
}
@@ -391,12 +448,13 @@ deserialize_collection_mutation(const abstract_type& type, collection_mutation_i
return visit(type, make_visitor(
[&] (const collection_type_impl& ctype) {
// value_comparator(), ugh
return deserialize_collection_mutation(in, [&ctype] (collection_mutation_input_stream& in) {
auto& type_info = ctype.value_comparator()->imr_state().type_info();
return deserialize_collection_mutation(in, [&type_info] (collection_mutation_input_stream& in) {
// FIXME: we could probably avoid the need for size
auto ksize = in.read_trivial<uint32_t>();
auto key = in.read(ksize);
auto vsize = in.read_trivial<uint32_t>();
auto value = atomic_cell_view::from_bytes(*ctype.value_comparator(), in.read(vsize));
auto value = atomic_cell_view::from_bytes(type_info, in.read(vsize));
return std::make_pair(key, value);
});
},
@@ -406,7 +464,8 @@ deserialize_collection_mutation(const abstract_type& type, collection_mutation_i
auto ksize = in.read_trivial<uint32_t>();
auto key = in.read(ksize);
auto vsize = in.read_trivial<uint32_t>();
auto value = atomic_cell_view::from_bytes(*utype.type(deserialize_field_index(key)), in.read(vsize));
auto value = atomic_cell_view::from_bytes(
utype.type(deserialize_field_index(key))->imr_state().type_info(), in.read(vsize));
return std::make_pair(key, value);
});
},

View File

@@ -31,6 +31,7 @@
#include <iosfwd>
class abstract_type;
class bytes_ostream;
class compaction_garbage_collector;
class row_tombstone;
@@ -69,7 +70,7 @@ struct collection_mutation_view_description {
collection_mutation serialize(const abstract_type&) const;
};
using collection_mutation_input_stream = utils::linearizing_input_stream<fragment_range<managed_bytes_view>, marshal_exception>;
using collection_mutation_input_stream = utils::linearizing_input_stream<atomic_cell_value_view, marshal_exception>;
// Given a linearized collection_mutation_view, returns an auxiliary struct allowing the inspection of each cell.
// The struct is an observer of the data given by the collection_mutation_view and is only valid while the
@@ -79,7 +80,7 @@ collection_mutation_view_description deserialize_collection_mutation(const abstr
class collection_mutation_view {
public:
managed_bytes_view data;
atomic_cell_value_view data;
// Is this a noop mutation?
bool is_empty() const;
@@ -96,7 +97,7 @@ public:
// calls it on the corresponding description of `this`.
template <typename F>
inline decltype(auto) with_deserialized(const abstract_type& type, F f) const {
auto stream = collection_mutation_input_stream(fragment_range(data));
auto stream = collection_mutation_input_stream(data);
return f(deserialize_collection_mutation(type, stream));
}
@@ -121,11 +122,12 @@ public:
// The mutation may also contain a collection-wide tombstone.
class collection_mutation {
public:
managed_bytes _data;
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
collection_mutation() {}
collection_mutation(const abstract_type&, collection_mutation_view);
collection_mutation(const abstract_type&, managed_bytes);
collection_mutation(const abstract_type& type, const bytes_ostream& data);
operator collection_mutation_view() const;
};

View File

@@ -43,7 +43,7 @@ public:
const ::schema& schema() const {
return *_schema;
}
friend std::strong_ordering tri_compare(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
friend int tri_compare(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return dht::ring_position_tri_compare(*x._schema, *x._rpv, *y._rpv);
}
friend bool operator<(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
@@ -83,7 +83,7 @@ public:
const ::schema& schema() const {
return *_schema;
}
friend std::strong_ordering tri_compare(const compatible_ring_position& x, const compatible_ring_position& y) {
friend int tri_compare(const compatible_ring_position& x, const compatible_ring_position& y) {
return dht::ring_position_tri_compare(*x._schema, *x._rp, *y._rp);
}
friend bool operator<(const compatible_ring_position& x, const compatible_ring_position& y) {
@@ -133,7 +133,7 @@ public:
};
return std::visit(rpv_accessor{}, *_crp_or_view);
}
friend std::strong_ordering tri_compare(const compatible_ring_position_or_view& x, const compatible_ring_position_or_view& y) {
friend int tri_compare(const compatible_ring_position_or_view& x, const compatible_ring_position_or_view& y) {
struct schema_accessor {
const ::schema& operator()(const compatible_ring_position& crp) {
return crp.schema();

View File

@@ -126,21 +126,18 @@ def ensure_tmp_dir_exists():
os.makedirs(tempfile.tempdir)
def try_compile_and_link(compiler, source='', flags=[], verbose=False):
def try_compile_and_link(compiler, source='', flags=[]):
ensure_tmp_dir_exists()
with tempfile.NamedTemporaryFile() as sfile:
ofile = tempfile.mktemp()
try:
sfile.file.write(bytes(source, 'utf-8'))
sfile.file.flush()
ret = subprocess.run([compiler, '-x', 'c++', '-o', ofile, sfile.name] + args.user_cflags.split() + flags,
capture_output=True)
if verbose:
print(f"Compilation failed: {compiler} -x c++ -o {ofile} {sfile.name} {args.user_cflags} {flags}")
print(source)
print(ret.stdout.decode('utf-8'))
print(ret.stderr.decode('utf-8'))
return ret.returncode == 0
# We can't write to /dev/null, since in some cases (-ftest-coverage) gcc will create an auxiliary
# output file based on the name of the output file, and "/dev/null.gcsa" is not a good name
return subprocess.call([compiler, '-x', 'c++', '-o', ofile, sfile.name] + args.user_cflags.split() + flags,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL) == 0
finally:
if os.path.exists(ofile):
os.unlink(ofile)
@@ -167,21 +164,7 @@ def linker_flags(compiler):
link_flags.append(threads_flag)
return ' '.join(link_flags)
else:
linker = ''
try:
subprocess.call(["gold", "-v"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
linker = 'gold'
except:
pass
try:
subprocess.call(["lld", "-v"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
linker = 'lld'
except:
pass
if linker:
print(f'Linker {linker} found, but the compilation attempt failed, defaulting to default system linker')
else:
print('Note: neither lld nor gold found; using default system linker')
print('Note: neither lld nor gold found; using default system linker')
return ''
@@ -275,25 +258,21 @@ modes = {
'cxxflags': '-DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxx_ld_flags': '',
'stack-usage-threshold': 1024*40,
'optimization-level': 'g',
},
'release': {
'cxxflags': '-ffunction-sections -fdata-sections ',
'cxxflags': '-O3 -ffunction-sections -fdata-sections ',
'cxx_ld_flags': '-Wl,--gc-sections',
'stack-usage-threshold': 1024*13,
'optimization-level': '3',
},
'dev': {
'cxxflags': '-DDEVEL -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxxflags': '-O1 -DDEVEL -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxx_ld_flags': '',
'stack-usage-threshold': 1024*21,
'optimization-level': '2',
},
'sanitize': {
'cxxflags': '-DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxxflags': '-Os -DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxx_ld_flags': '',
'stack-usage-threshold': 1024*50,
'optimization-level': 's',
}
}
@@ -356,6 +335,7 @@ scylla_tests = set([
'test/boost/hash_test',
'test/boost/hashers_test',
'test/boost/idl_test',
'test/boost/imr_test',
'test/boost/input_stream_test',
'test/boost/json_cql_query_test',
'test/boost/json_test',
@@ -373,6 +353,7 @@ scylla_tests = set([
'test/boost/intrusive_array_test',
'test/boost/map_difference_test',
'test/boost/memtable_test',
'test/boost/meta_test',
'test/boost/multishard_mutation_query_test',
'test/boost/murmur_hash_test',
'test/boost/mutation_fragment_test',
@@ -397,7 +378,6 @@ scylla_tests = set([
'test/boost/schema_change_test',
'test/boost/schema_registry_test',
'test/boost/secondary_index_test',
'test/boost/tracing',
'test/boost/index_with_paging_test',
'test/boost/serialization_test',
'test/boost/serialized_action_test',
@@ -412,7 +392,6 @@ scylla_tests = set([
'test/boost/sstable_directory_test',
'test/boost/sstable_test',
'test/boost/sstable_move_test',
'test/boost/statement_restrictions_test',
'test/boost/storage_proxy_test',
'test/boost/top_k_test',
'test/boost/transport_test',
@@ -428,13 +407,9 @@ scylla_tests = set([
'test/boost/vint_serialization_test',
'test/boost/virtual_reader_test',
'test/boost/bptree_test',
'test/boost/btree_test',
'test/boost/radix_tree_test',
'test/boost/double_decker_test',
'test/boost/stall_free_test',
'test/boost/raft_address_map_test',
'test/boost/raft_sys_table_storage_test',
'test/boost/sstable_set_test',
'test/boost/imr_test',
'test/manual/ec2_snitch_test',
'test/manual/enormous_table_scan_test',
'test/manual/gce_snitch_test',
@@ -453,7 +428,6 @@ scylla_tests = set([
'test/perf/perf_mutation',
'test/perf/perf_collection',
'test/perf/perf_row_cache_update',
'test/perf/perf_row_cache_reads',
'test/perf/perf_simple_query',
'test/perf/perf_sstable',
'test/unit/lsa_async_eviction_test',
@@ -461,11 +435,7 @@ scylla_tests = set([
'test/unit/row_cache_alloc_stress_test',
'test/unit/row_cache_stress_test',
'test/unit/bptree_stress_test',
'test/unit/btree_stress_test',
'test/unit/bptree_compaction_test',
'test/unit/btree_compaction_test',
'test/unit/radix_tree_stress_test',
'test/unit/radix_tree_compaction_test',
])
perf_tests = set([
@@ -479,8 +449,7 @@ perf_tests = set([
raft_tests = set([
'test/raft/replication_test',
'test/raft/fsm_test',
'test/raft/etcd_test',
'test/boost/raft_fsm_test',
])
apps = set([
@@ -527,8 +496,6 @@ arg_parser.add_argument('--dpdk-target', action='store', dest='dpdk_target', def
help='Path to DPDK SDK target location (e.g. <DPDK SDK dir>/x86_64-native-linuxapp-gcc)')
arg_parser.add_argument('--debuginfo', action='store', dest='debuginfo', type=int, default=1,
help='Enable(1)/disable(0)compiler debug information generation')
arg_parser.add_argument('--optimization-level', action='append', dest='mode_o_levels', metavar='MODE=LEVEL', default=[],
help=f'Override default compiler optimization level for mode (defaults: {" ".join([x+"="+modes[x]["optimization-level"] for x in modes])})')
arg_parser.add_argument('--static-stdc++', dest='staticcxx', action='store_true',
help='Link libgcc and libstdc++ statically')
arg_parser.add_argument('--static-thrift', dest='staticthrift', action='store_true',
@@ -551,34 +518,25 @@ arg_parser.add_argument('--with-antlr3', dest='antlr3_exec', action='store', def
help='path to antlr3 executable')
arg_parser.add_argument('--with-ragel', dest='ragel_exec', action='store', default='ragel',
help='path to ragel executable')
arg_parser.add_argument('--build-raft', dest='build_raft', action='store_true', default=False,
help='build raft code')
add_tristate(arg_parser, name='stack-guards', dest='stack_guards', help='Use stack guards')
arg_parser.add_argument('--verbose', dest='verbose', action='store_true',
help='Make configure.py output more verbose (useful for debugging the build process itself)')
arg_parser.add_argument('--test-repeat', dest='test_repeat', action='store', type=str, default='1',
help='Set number of times to repeat each unittest.')
arg_parser.add_argument('--test-timeout', dest='test_timeout', action='store', type=str, default='7200')
arg_parser.add_argument('--clang-inline-threshold', action='store', type=int, dest='clang_inline_threshold', default=-1,
help="LLVM-specific inline threshold compilation parameter")
args = arg_parser.parse_args()
if not args.build_raft:
all_artifacts.difference_update(raft_tests)
tests.difference_update(raft_tests)
defines = ['XXH_PRIVATE_API',
'SEASTAR_TESTING_MAIN',
]
extra_cxxflags = {
'debug': {},
'dev': {},
'release': {},
'sanitize': {}
}
scylla_raft_core = [
'raft/raft.cc',
'raft/server.cc',
'raft/fsm.cc',
'raft/tracker.cc',
'raft/log.cc',
]
extra_cxxflags = {}
scylla_core = (['database.cc',
'absl-flat_hash_map.cc',
@@ -621,16 +579,15 @@ scylla_core = (['database.cc',
'counters.cc',
'compress.cc',
'zstd.cc',
'sstables/mp_row_consumer.cc',
'sstables/sstables.cc',
'sstables/sstables_manager.cc',
'sstables/sstable_set.cc',
'sstables/mx/reader.cc',
'sstables/mx/writer.cc',
'sstables/kl/reader.cc',
'sstables/kl/writer.cc',
'sstables/sstable_version.cc',
'sstables/compress.cc',
'sstables/sstable_mutation_reader.cc',
'sstables/partition.cc',
'sstables/compaction.cc',
'sstables/compaction_strategy.cc',
'sstables/size_tiered_compaction_strategy.cc',
@@ -887,6 +844,7 @@ scylla_core = (['database.cc',
'vint-serialization.cc',
'utils/arch/powerpc/crc32-vpmsum/crc32_wrapper.cc',
'querier.cc',
'data/cell.cc',
'mutation_writer/multishard_writer.cc',
'multishard_mutation_query.cc',
'reader_concurrency_semaphore.cc',
@@ -899,14 +857,7 @@ scylla_core = (['database.cc',
'mutation_writer/shard_based_splitting_writer.cc',
'mutation_writer/feed_writers.cc',
'lua.cc',
'service/raft/schema_raft_state_machine.cc',
'service/raft/raft_sys_table_storage.cc',
'serializer.cc',
'service/raft/raft_rpc.cc',
'service/raft/raft_gossip_failure_detector.cc',
'service/raft/raft_services.cc',
] + [Antlr3Grammar('cql3/Cql.g')] + [Thrift('interface/cassandra.thrift', 'Cassandra')] \
+ scylla_raft_core
] + [Antlr3Grammar('cql3/Cql.g')] + [Thrift('interface/cassandra.thrift', 'Cassandra')]
)
api = ['api/api.cc',
@@ -1002,7 +953,6 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/view.idl.hh',
'idl/messaging_service.idl.hh',
'idl/paxos.idl.hh',
'idl/raft.idl.hh',
]
headers = find_headers('.', excluded_dirs=['idl', 'build', 'seastar', '.git'])
@@ -1027,7 +977,14 @@ scylla_tests_dependencies = scylla_core + idls + scylla_tests_generic_dependenci
'test/lib/random_schema.cc',
]
scylla_raft_dependencies = scylla_raft_core + ['utils/uuid.cc']
scylla_raft_dependencies = [
'raft/raft.cc',
'raft/server.cc',
'raft/fsm.cc',
'raft/progress.cc',
'raft/log.cc',
'utils/uuid.cc'
]
deps = {
'scylla': idls + ['main.cc', 'release.cc', 'utils/build_id.cc'] + scylla_core + api + alternator + redis,
@@ -1061,6 +1018,7 @@ pure_boost_tests = set([
'test/boost/like_matcher_test',
'test/boost/linearizing_input_stream_test',
'test/boost/map_difference_test',
'test/boost/meta_test',
'test/boost/nonwrapping_range_test',
'test/boost/observable_test',
'test/boost/range_test',
@@ -1071,7 +1029,6 @@ pure_boost_tests = set([
'test/boost/vint_serialization_test',
'test/boost/bptree_test',
'test/boost/utf8_test',
'test/boost/btree_test',
'test/manual/streaming_histogram_test',
])
@@ -1091,11 +1048,7 @@ tests_not_using_seastar_test_framework = set([
'test/unit/lsa_sync_eviction_test',
'test/unit/row_cache_alloc_stress_test',
'test/unit/bptree_stress_test',
'test/unit/btree_stress_test',
'test/unit/bptree_compaction_test',
'test/unit/btree_compaction_test',
'test/unit/radix_tree_stress_test',
'test/unit/radix_tree_compaction_test',
'test/manual/sstable_scan_footprint_test',
]) | pure_boost_tests
@@ -1138,6 +1091,8 @@ deps['test/boost/estimated_histogram_test'] = ['test/boost/estimated_histogram_t
deps['test/boost/anchorless_list_test'] = ['test/boost/anchorless_list_test.cc']
deps['test/perf/perf_fast_forward'] += ['release.cc']
deps['test/perf/perf_simple_query'] += ['release.cc']
deps['test/boost/meta_test'] = ['test/boost/meta_test.cc']
deps['test/boost/imr_test'] = ['test/boost/imr_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['test/boost/reusable_buffer_test'] = [
"test/boost/reusable_buffer_test.cc",
"test/lib/log.cc",
@@ -1155,8 +1110,7 @@ deps['test/boost/duration_test'] += ['test/lib/exception_utils.cc']
deps['test/boost/alternator_unit_test'] += ['alternator/base64.cc']
deps['test/raft/replication_test'] = ['test/raft/replication_test.cc'] + scylla_raft_dependencies
deps['test/raft/fsm_test'] = ['test/raft/fsm_test.cc', 'test/lib/log.cc'] + scylla_raft_dependencies
deps['test/raft/etcd_test'] = ['test/raft/etcd_test.cc', 'test/lib/log.cc'] + scylla_raft_dependencies
deps['test/boost/raft_fsm_test'] = ['test/boost/raft_fsm_test.cc', 'test/lib/log.cc'] + scylla_raft_dependencies
deps['utils/gz/gen_crc_combine_table'] = ['utils/gz/gen_crc_combine_table.cc']
@@ -1212,18 +1166,9 @@ warnings = [w
warnings = ' '.join(warnings + ['-Wno-error=deprecated-declarations'])
def clang_inline_threshold():
if args.clang_inline_threshold != -1:
return args.clang_inline_threshold
elif platform.machine() == 'aarch64':
# we see miscompiles with 1200 and above with format("{}", uuid)
return 600
else:
return 2500
optimization_flags = [
'--param inline-unit-growth=300', # gcc
f'-mllvm -inline-threshold={clang_inline_threshold()}', # clang
'-mllvm -inline-threshold=2500', # clang
]
optimization_flags = [o
for o in optimization_flags
@@ -1234,15 +1179,6 @@ if flag_supported(flag='-Wstack-usage=4096', compiler=args.cxx):
for mode in modes:
modes[mode]['cxxflags'] += f' -Wstack-usage={modes[mode]["stack-usage-threshold"]} -Wno-error=stack-usage='
for mode_level in args.mode_o_levels:
( mode, level ) = mode_level.split('=', 2)
if mode not in modes:
raise Exception(f'Mode {mode} is missing, cannot configure optimization level for it')
modes[mode]['optimization-level'] = level
for mode in modes:
modes[mode]['cxxflags'] += f' -O{modes[mode]["optimization-level"]}'
linker_flags = linker_flags(compiler=args.cxx)
dbgflag = '-g -gz' if args.debuginfo else ''
@@ -1308,8 +1244,7 @@ compiler_test_src = '''
int main() { return 0; }
'''
if not try_compile_and_link(compiler=args.cxx, source=compiler_test_src):
try_compile_and_link(compiler=args.cxx, source=compiler_test_src, verbose=True)
print('Wrong compiler version or incorrect flags. Scylla needs GCC >= 10.1.1 with coroutines (-fcoroutines) or clang >= 10.0.0 to compile.')
print('Wrong GCC version. Scylla needs GCC >= 10.1.1 to compile.')
sys.exit(1)
if not try_compile(compiler=args.cxx, source='#include <boost/version.hpp>'):
@@ -1360,9 +1295,7 @@ scylla_release = file.read().strip()
file = open(f'{outdir}/SCYLLA-PRODUCT-FILE', 'r')
scylla_product = file.read().strip()
for m in ['debug', 'release', 'sanitize', 'dev']:
cxxflags = "-DSCYLLA_VERSION=\"\\\"" + scylla_version + "\\\"\" -DSCYLLA_RELEASE=\"\\\"" + scylla_release + "\\\"\" -DSCYLLA_BUILD_MODE=\"\\\"" + m + "\\\"\""
extra_cxxflags[m]["release.cc"] = cxxflags
extra_cxxflags["release.cc"] = "-DSCYLLA_VERSION=\"\\\"" + scylla_version + "\\\"\" -DSCYLLA_RELEASE=\"\\\"" + scylla_release + "\\\"\""
for m in ['debug', 'release', 'sanitize']:
modes[m]['cxxflags'] += ' ' + dbgflag
@@ -1516,7 +1449,6 @@ abseil_libs = ['absl/' + lib for lib in [
'numeric/libabsl_int128.a',
'hash/libabsl_city.a',
'hash/libabsl_hash.a',
'hash/libabsl_wyhash.a',
'base/libabsl_malloc_internal.a',
'base/libabsl_spinlock_wait.a',
'base/libabsl_base.a',
@@ -1537,6 +1469,9 @@ libs = ' '.join([maybe_static(args.staticyamlcpp, '-lyaml-cpp'), '-latomic', '-l
if not args.staticboost:
args.user_cflags += ' -DBOOST_TEST_DYN_LINK'
if build_raft:
args.user_cflags += ' -DENABLE_SCYLLA_RAFT'
# thrift version detection, see #4538
proc_res = subprocess.run(["thrift", "-version"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
proc_res_output = proc_res.stdout.decode("utf-8")
@@ -1817,8 +1752,8 @@ with open(buildfile_tmp, 'w') as f:
for obj in compiles:
src = compiles[obj]
f.write('build {}: cxx.{} {} || {} {}\n'.format(obj, mode, src, seastar_dep, gen_headers_dep))
if src in extra_cxxflags[mode]:
f.write(' cxxflags = {seastar_cflags} $cxxflags $cxxflags_{mode} {extra_cxxflags}\n'.format(mode=mode, extra_cxxflags=extra_cxxflags[mode][src], **modeval))
if src in extra_cxxflags:
f.write(' cxxflags = {seastar_cflags} $cxxflags $cxxflags_{mode} {extra_cxxflags}\n'.format(mode=mode, extra_cxxflags=extra_cxxflags[src], **modeval))
for swagger in swaggers:
hh = swagger.headers(gen_dir)[0]
cc = swagger.sources(gen_dir)[0]
@@ -1883,7 +1818,7 @@ with open(buildfile_tmp, 'w') as f:
f.write(f'build dist-server-{mode}: phony $builddir/dist/{mode}/redhat $builddir/dist/{mode}/debian\n')
f.write(f'build dist-jmx-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-jmx-package.tar.gz dist-jmx-rpm dist-jmx-deb\n')
f.write(f'build dist-tools-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-tools-package.tar.gz dist-tools-rpm dist-tools-deb\n')
f.write(f'build dist-python3-{mode}: phony dist-python3-tar dist-python3-rpm dist-python3-deb\n')
f.write(f'build dist-python3-{mode}: phony dist-python3-tar dist-python3-rpm dist-python3-deb compat-python3-rpm compat-python3-deb\n')
f.write(f'build dist-unified-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-unified-package-{scylla_version}.{scylla_release}.tar.gz\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unified-package-{scylla_version}.{scylla_release}.tar.gz: unified $builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-python3-package.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-jmx-package.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-tools-package.tar.gz | always\n')
f.write(f' mode = {mode}\n')
@@ -1949,6 +1884,22 @@ with open(buildfile_tmp, 'w') as f:
build dist-tools-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-tools-package.tar.gz'.format(mode=mode, scylla_product=scylla_product) for mode in build_modes])}
build dist-tools: phony dist-tools-tar dist-tools-rpm dist-tools-deb
rule compat-python3-reloc
command = mkdir -p $builddir/release && ln -f $dir/$artifact $builddir/release/
rule compat-python3-rpm
command = cd $dir && ./reloc/build_rpm.sh --reloc-pkg $artifact --builddir ../../build/redhat
rule compat-python3-deb
command = cd $dir && ./reloc/build_deb.sh --reloc-pkg $artifact --builddir ../../build/debian
build $builddir/release/{scylla_product}-python3-package.tar.gz: compat-python3-reloc tools/python3/build/{scylla_product}-python3-package.tar.gz
dir = tools/python3
artifact = $builddir/{scylla_product}-python3-package.tar.gz
build compat-python3-rpm: compat-python3-rpm tools/python3/build/{scylla_product}-python3-package.tar.gz
dir = tools/python3
artifact = $builddir/{scylla_product}-python3-package.tar.gz
build compat-python3-deb: compat-python3-deb tools/python3/build/{scylla_product}-python3-package.tar.gz
dir = tools/python3
artifact = $builddir/{scylla_product}-python3-package.tar.gz
build tools/python3/build/{scylla_product}-python3-package.tar.gz: build-submodule-reloc
reloc_dir = tools/python3
args = --packages "{python3_dependencies}"
@@ -1959,7 +1910,7 @@ with open(buildfile_tmp, 'w') as f:
dir = tools/python3
artifact = $builddir/{scylla_product}-python3-package.tar.gz
build dist-python3-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-python3-package.tar.gz'.format(mode=mode, scylla_product=scylla_product) for mode in build_modes])}
build dist-python3: phony dist-python3-tar dist-python3-rpm dist-python3-deb $builddir/release/{scylla_product}-python3-package.tar.gz
build dist-python3: phony dist-python3-tar dist-python3-rpm dist-python3-deb $builddir/release/{scylla_product}-python3-package.tar.gz compat-python3-rpm compat-python3-deb
build dist-deb: phony dist-server-deb dist-python3-deb dist-jmx-deb dist-tools-deb
build dist-rpm: phony dist-server-rpm dist-python3-rpm dist-jmx-rpm dist-tools-rpm
build dist-tar: phony dist-unified-tar dist-server-tar dist-python3-tar dist-jmx-tar dist-tools-tar

View File

@@ -36,9 +36,9 @@ converting_mutation_partition_applier::upgrade_cell(const abstract_type& new_typ
atomic_cell::collection_member cm) {
if (cell.is_live() && !old_type.is_counter()) {
if (cell.is_live_and_has_ttl()) {
return atomic_cell::make_live(new_type, cell.timestamp(), cell.value(), cell.expiry(), cell.ttl(), cm);
return atomic_cell::make_live(new_type, cell.timestamp(), cell.value().linearize(), cell.expiry(), cell.ttl(), cm);
}
return atomic_cell::make_live(new_type, cell.timestamp(), cell.value(), cm);
return atomic_cell::make_live(new_type, cell.timestamp(), cell.value().linearize(), cm);
} else {
return atomic_cell(new_type, cell);
}

View File

@@ -118,14 +118,16 @@ void counter_cell_view::apply(const column_definition& cdef, atomic_cell_or_coll
assert(!dst_ac.is_counter_update());
assert(!src_ac.is_counter_update());
with_linearized(dst_ac, [&] (counter_cell_view dst_ccv) {
with_linearized(src_ac, [&] (counter_cell_view src_ccv) {
auto src_ccv = counter_cell_view(src_ac);
auto dst_ccv = counter_cell_view(dst_ac);
if (dst_ccv.shard_count() >= src_ccv.shard_count()) {
auto dst_amc = dst.as_mutable_atomic_cell(cdef);
auto src_amc = src.as_mutable_atomic_cell(cdef);
if (apply_in_place(cdef, dst_amc, src_amc)) {
return;
if (!dst_amc.is_value_fragmented() && !src_amc.is_value_fragmented()) {
if (apply_in_place(cdef, dst_amc, src_amc)) {
return;
}
}
}
@@ -140,6 +142,8 @@ void counter_cell_view::apply(const column_definition& cdef, atomic_cell_or_coll
auto cell = result.build(std::max(dst_ac.timestamp(), src_ac.timestamp()));
src = std::exchange(dst, atomic_cell_or_collection(std::move(cell)));
});
});
}
std::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, atomic_cell_view b)
@@ -154,8 +158,8 @@ std::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, ato
return { };
}
auto a_ccv = counter_cell_view(a);
auto b_ccv = counter_cell_view(b);
return with_linearized(a, [&] (counter_cell_view a_ccv) {
return with_linearized(b, [&] (counter_cell_view b_ccv) {
auto a_shards = a_ccv.shards();
auto b_shards = b_ccv.shards();
@@ -182,6 +186,8 @@ std::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, ato
diff = atomic_cell::make_live(*counter_type, a.timestamp(), bytes_view());
}
return diff;
});
});
}
@@ -219,13 +225,14 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
auto ccv = counter_cell_view(acv);
counter_cell_view::with_linearized(acv, [&] (counter_cell_view ccv) {
auto cs = ccv.get_shard(counter_id(local_id));
if (!cs) {
return; // continue
}
shards.emplace_back(std::make_pair(id, counter_shard(*cs)));
});
});
transformee.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
auto& cdef = s.column_at(kind, id);

View File

@@ -81,20 +81,21 @@ class basic_counter_shard_view {
total_size = unsigned(logical_clock) + sizeof(int64_t),
};
private:
managed_bytes_basic_view<is_mutable> _base;
using pointer_type = std::conditional_t<is_mutable == mutable_view::no, const signed char*, signed char*>;
pointer_type _base;
private:
template<typename T>
T read(offset off) const {
auto v = _base;
v.remove_prefix(size_t(off));
return read_simple_native<T>(v);
T value;
std::copy_n(_base + static_cast<unsigned>(off), sizeof(T), reinterpret_cast<signed char*>(&value));
return value;
}
public:
static constexpr auto size = size_t(offset::total_size);
public:
basic_counter_shard_view() = default;
explicit basic_counter_shard_view(managed_bytes_basic_view<is_mutable> v) noexcept
: _base(v) { }
explicit basic_counter_shard_view(pointer_type ptr) noexcept
: _base(ptr) { }
counter_id id() const { return read<counter_id>(offset::id); }
int64_t value() const { return read<int64_t>(offset::value); }
@@ -105,24 +106,15 @@ public:
static constexpr size_t size = size_t(offset::total_size) - off;
signed char tmp[size];
auto tmp_view = single_fragmented_mutable_view(bytes_mutable_view(std::data(tmp), std::size(tmp)));
managed_bytes_mutable_view this_view = _base.substr(off, size);
managed_bytes_mutable_view other_view = other._base.substr(off, size);
copy_fragmented_view(tmp_view, this_view);
copy_fragmented_view(this_view, other_view);
copy_fragmented_view(other_view, tmp_view);
std::copy_n(_base + off, size, tmp);
std::copy_n(other._base + off, size, _base + off);
std::copy_n(tmp, size, other._base + off);
}
void set_value_and_clock(const basic_counter_shard_view& other) noexcept {
static constexpr size_t off = size_t(offset::value);
static constexpr size_t size = size_t(offset::total_size) - off;
managed_bytes_mutable_view this_view = _base.substr(off, size);
managed_bytes_mutable_view other_view = other._base.substr(off, size);
copy_fragmented_view(this_view, other_view);
std::copy_n(other._base + off, size, _base + off);
}
bool operator==(const basic_counter_shard_view& other) const {
@@ -148,6 +140,11 @@ class counter_shard {
counter_id _id;
int64_t _value;
int64_t _logical_clock;
private:
template<typename T>
static void write(const T& value, bytes::iterator& out) {
out = std::copy_n(reinterpret_cast<const signed char*>(&value), sizeof(T), out);
}
private:
// Shared logic for applying counter_shards and counter_shard_views.
// T is either counter_shard or basic_counter_shard_view<U>.
@@ -198,10 +195,10 @@ public:
static constexpr size_t serialized_size() {
return counter_shard_view::size;
}
void serialize(atomic_cell_value_mutable_view& out) const {
write_native<counter_id>(out, _id);
write_native<int64_t>(out, _value);
write_native<int64_t>(out, _logical_clock);
void serialize(bytes::iterator& out) const {
write(_id, out);
write(_value, out);
write(_logical_clock, out);
}
};
@@ -238,7 +235,7 @@ public:
size_t serialized_size() const {
return _shards.size() * counter_shard::serialized_size();
}
void serialize(atomic_cell_value_mutable_view& out) const {
void serialize(bytes::iterator& out) const {
for (auto&& cs : _shards) {
cs.serialize(out);
}
@@ -249,18 +246,31 @@ public:
}
atomic_cell build(api::timestamp_type timestamp) const {
// If we can assume that the counter shards never cross fragment boundaries
// the serialisation code gets much simpler.
static_assert(data::cell::maximum_external_chunk_length % counter_shard::serialized_size() == 0);
auto ac = atomic_cell::make_live_uninitialized(*counter_type, timestamp, serialized_size());
auto dst = ac.value();
auto dst_it = ac.value().begin();
auto dst_current = *dst_it++;
for (auto&& cs : _shards) {
cs.serialize(dst);
if (dst_current.empty()) {
dst_current = *dst_it++;
}
assert(!dst_current.empty());
auto value_dst = dst_current.data();
cs.serialize(value_dst);
dst_current.remove_prefix(counter_shard::serialized_size());
}
return ac;
}
static atomic_cell from_single_shard(api::timestamp_type timestamp, const counter_shard& cs) {
// We don't really need to bother with fragmentation here.
static_assert(data::cell::maximum_external_chunk_length >= counter_shard::serialized_size());
auto ac = atomic_cell::make_live_uninitialized(*counter_type, timestamp, counter_shard::serialized_size());
auto dst = ac.value();
auto dst = ac.value().first_fragment().begin();
cs.serialize(dst);
return ac;
}
@@ -299,7 +309,12 @@ public:
template<mutable_view is_mutable>
class basic_counter_cell_view {
protected:
using linearized_value_view = std::conditional_t<is_mutable == mutable_view::no,
bytes_view, bytes_mutable_view>;
using pointer_type = std::conditional_t<is_mutable == mutable_view::no,
bytes_view::const_pointer, bytes_mutable_view::pointer>;
basic_atomic_cell_view<is_mutable> _cell;
linearized_value_view _value;
private:
class shard_iterator {
public:
@@ -309,12 +324,12 @@ private:
using pointer = basic_counter_shard_view<is_mutable>*;
using reference = basic_counter_shard_view<is_mutable>&;
private:
managed_bytes_basic_view<is_mutable> _current;
pointer_type _current;
basic_counter_shard_view<is_mutable> _current_view;
size_t _pos = 0;
public:
shard_iterator(managed_bytes_basic_view<is_mutable> v, size_t offset) noexcept
: _current(v), _current_view(_current), _pos(offset) { }
shard_iterator() = default;
shard_iterator(pointer_type ptr) noexcept
: _current(ptr), _current_view(ptr) { }
basic_counter_shard_view<is_mutable>& operator*() noexcept {
return _current_view;
@@ -323,8 +338,8 @@ private:
return &_current_view;
}
shard_iterator& operator++() noexcept {
_pos += counter_shard_view::size;
_current_view = basic_counter_shard_view<is_mutable>(_current.substr(_pos, counter_shard_view::size));
_current += counter_shard_view::size;
_current_view = basic_counter_shard_view<is_mutable>(_current);
return *this;
}
shard_iterator operator++(int) noexcept {
@@ -333,8 +348,8 @@ private:
return it;
}
shard_iterator& operator--() noexcept {
_pos -= counter_shard_view::size;
_current_view = basic_counter_shard_view<is_mutable>(_current.substr(_pos, counter_shard_view::size));
_current -= counter_shard_view::size;
_current_view = basic_counter_shard_view<is_mutable>(_current);
return *this;
}
shard_iterator operator--(int) noexcept {
@@ -343,29 +358,31 @@ private:
return it;
}
bool operator==(const shard_iterator& other) const noexcept {
return _pos == other._pos;
return _current == other._current;
}
bool operator!=(const shard_iterator& other) const noexcept {
return !(*this == other);
}
};
public:
boost::iterator_range<shard_iterator> shards() const {
auto value = _cell.value();
auto begin = shard_iterator(value, 0);
auto end = shard_iterator(value, value.size());
auto begin = shard_iterator(_value.data());
auto end = shard_iterator(_value.data() + _value.size());
return boost::make_iterator_range(begin, end);
}
size_t shard_count() const {
return _cell.value().size() / counter_shard_view::size;
return _cell.value().size_bytes() / counter_shard_view::size;
}
public:
protected:
// ac must be a live counter cell
explicit basic_counter_cell_view(basic_atomic_cell_view<is_mutable> ac) noexcept
: _cell(ac)
explicit basic_counter_cell_view(basic_atomic_cell_view<is_mutable> ac, linearized_value_view vv) noexcept
: _cell(ac), _value(vv)
{
assert(_cell.is_live());
assert(!_cell.is_counter_update());
}
public:
api::timestamp_type timestamp() const { return _cell.timestamp(); }
static data_type total_value_type() { return long_type; }
@@ -394,6 +411,14 @@ public:
struct counter_cell_view : basic_counter_cell_view<mutable_view::no> {
using basic_counter_cell_view::basic_counter_cell_view;
template<typename Function>
static decltype(auto) with_linearized(basic_atomic_cell_view<mutable_view::no> ac, Function&& fn) {
return ac.value().with_linearized([&] (bytes_view value_view) {
counter_cell_view ccv(ac, value_view);
return fn(ccv);
});
}
// Reversibly applies two counter cells, at least one of them must be live.
static void apply(const column_definition& cdef, atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
@@ -408,8 +433,9 @@ struct counter_cell_mutable_view : basic_counter_cell_view<mutable_view::yes> {
using basic_counter_cell_view::basic_counter_cell_view;
explicit counter_cell_mutable_view(atomic_cell_mutable_view ac) noexcept
: basic_counter_cell_view<mutable_view::yes>(ac)
: basic_counter_cell_view<mutable_view::yes>(ac, ac.value().first_fragment())
{
assert(!ac.value().is_fragmented());
}
void set_timestamp(api::timestamp_type ts) { _cell.set_timestamp(ts); }

View File

@@ -931,7 +931,7 @@ alterKeyspaceStatement returns [std::unique_ptr<cql3::statements::alter_keyspace
alterTableStatement returns [std::unique_ptr<alter_table_statement> expr]
@init {
alter_table_statement::type type;
auto props = cql3::statements::cf_prop_defs();
auto props = make_shared<cql3::statements::cf_prop_defs>();
std::vector<alter_table_statement::column_change> column_changes;
std::vector<std::pair<shared_ptr<cql3::column_identifier::raw>, shared_ptr<cql3::column_identifier::raw>>> renames;
}
@@ -947,7 +947,7 @@ alterTableStatement returns [std::unique_ptr<alter_table_statement> expr]
| '(' id1=cident { column_changes.emplace_back(alter_table_statement::column_change{id1}); }
(',' idn=cident { column_changes.emplace_back(alter_table_statement::column_change{idn}); } )* ')'
)
| K_WITH properties[props] { type = alter_table_statement::type::opts; }
| K_WITH properties[*props] { type = alter_table_statement::type::opts; }
| K_RENAME { type = alter_table_statement::type::rename; }
id1=cident K_TO toId1=cident { renames.emplace_back(id1, toId1); }
( K_AND idn=cident K_TO toIdn=cident { renames.emplace_back(idn, toIdn); } )*
@@ -987,9 +987,9 @@ alterTypeStatement returns [std::unique_ptr<alter_type_statement> expr]
*/
alterViewStatement returns [std::unique_ptr<alter_view_statement> expr]
@init {
auto props = cql3::statements::cf_prop_defs();
auto props = make_shared<cql3::statements::cf_prop_defs>();
}
: K_ALTER K_MATERIALIZED K_VIEW cf=columnFamilyName K_WITH properties[props]
: K_ALTER K_MATERIALIZED K_VIEW cf=columnFamilyName K_WITH properties[*props]
{
$expr = std::make_unique<alter_view_statement>(std::move(cf), std::move(props));
}
@@ -1124,7 +1124,7 @@ dataResource returns [uninitialized<auth::resource> res]
: K_ALL K_KEYSPACES { $res = auth::resource(auth::resource_kind::data); }
| K_KEYSPACE ks = keyspaceName { $res = auth::make_data_resource($ks.id); }
| ( K_COLUMNFAMILY )? cf = columnFamilyName
{ $res = auth::make_data_resource($cf.name.has_keyspace() ? $cf.name.get_keyspace() : "", $cf.name.get_column_family()); }
{ $res = auth::make_data_resource($cf.name->get_keyspace(), $cf.name->get_column_family()); }
;
roleResource returns [uninitialized<auth::resource> res]
@@ -1261,8 +1261,8 @@ ident returns [shared_ptr<cql3::column_identifier> id]
// Keyspace & Column family names
keyspaceName returns [sstring id]
@init { auto name = cql3::cf_name(); }
: ksName[name] { $id = name.get_keyspace(); }
@init { auto name = make_shared<cql3::cf_name>(); }
: ksName[*name] { $id = name->get_keyspace(); }
;
indexName returns [::shared_ptr<cql3::index_name> name]
@@ -1270,9 +1270,9 @@ indexName returns [::shared_ptr<cql3::index_name> name]
: (ksName[*name] '.')? idxName[*name]
;
columnFamilyName returns [cql3::cf_name name]
@init { $name = cql3::cf_name(); }
: (ksName[name] '.')? cfName[name]
columnFamilyName returns [::shared_ptr<cql3::cf_name> name]
@init { $name = ::make_shared<cql3::cf_name>(); }
: (ksName[*name] '.')? cfName[*name]
;
userTypeName returns [uninitialized<cql3::ut_name> name]
@@ -1549,10 +1549,6 @@ relation[std::vector<cql3::relation_ptr>& clauses]
{
$clauses.emplace_back(cql3::multi_column_relation::create_non_in_relation(ids, type, literal));
}
| type=relationType K_SCYLLA_CLUSTERING_BOUND literal=tupleLiteral /* (a, b, c) > (1, 2, 3) or (a, b, c) > (?, ?, ?) */
{
$clauses.emplace_back(cql3::multi_column_relation::create_scylla_clustering_bound_non_in_relation(ids, type, literal));
}
| type=relationType tupleMarker=markerForTuple /* (a, b, c) >= ? */
{ $clauses.emplace_back(cql3::multi_column_relation::create_non_in_relation(ids, type, tupleMarker)); }
)
@@ -1919,8 +1915,6 @@ K_PARTITION: P A R T I T I O N;
K_SCYLLA_TIMEUUID_LIST_INDEX: S C Y L L A '_' T I M E U U I D '_' L I S T '_' I N D E X;
K_SCYLLA_COUNTER_SHARD_LIST: S C Y L L A '_' C O U N T E R '_' S H A R D '_' L I S T;
K_SCYLLA_CLUSTERING_BOUND: S C Y L L A '_' C L U S T E R I N G '_' B O U N D;
K_GROUP: G R O U P;

View File

@@ -35,28 +35,6 @@ struct authorized_prepared_statements_cache_size {
class authorized_prepared_statements_cache_key {
public:
using cache_key_type = std::pair<auth::authenticated_user, typename cql3::prepared_cache_key_type::cache_key_type>;
struct view {
const auth::authenticated_user& user_ref;
const cql3::prepared_cache_key_type& prep_cache_key_ref;
};
struct view_hasher {
size_t operator()(const view& kv) {
return cql3::authorized_prepared_statements_cache_key::hash(kv.user_ref, kv.prep_cache_key_ref.key());
}
};
struct view_equal {
bool operator()(const authorized_prepared_statements_cache_key& k1, const view& k2) {
return k1.key().first == k2.user_ref && k1.key().second == k2.prep_cache_key_ref.key();
}
bool operator()(const view& k2, const authorized_prepared_statements_cache_key& k1) {
return operator()(k1, k2);
}
};
private:
cache_key_type _key;
@@ -122,12 +100,10 @@ private:
public:
using key_type = cache_key_type;
using key_view_type = typename key_type::view;
using key_view_hasher = typename key_type::view_hasher;
using key_view_equal = typename key_type::view_equal;
using value_type = checked_weak_ptr;
using entry_is_too_big = typename cache_type::entry_is_too_big;
using value_ptr = typename cache_type::value_ptr;
using iterator = typename cache_type::iterator;
private:
cache_type _cache;
logging::logger& _logger;
@@ -148,12 +124,38 @@ public:
}).discard_result();
}
value_ptr find(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {
return _cache.find(key_view_type{user, prep_cache_key}, key_view_hasher(), key_view_equal());
iterator find(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {
struct key_view {
const auth::authenticated_user& user_ref;
const cql3::prepared_cache_key_type& prep_cache_key_ref;
};
struct hasher {
size_t operator()(const key_view& kv) {
return cql3::authorized_prepared_statements_cache_key::hash(kv.user_ref, kv.prep_cache_key_ref.key());
}
};
struct equal {
bool operator()(const key_type& k1, const key_view& k2) {
return k1.key().first == k2.user_ref && k1.key().second == k2.prep_cache_key_ref.key();
}
bool operator()(const key_view& k2, const key_type& k1) {
return operator()(k1, k2);
}
};
return _cache.find(key_view{user, prep_cache_key}, hasher(), equal());
}
iterator end() {
return _cache.end();
}
void remove(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {
_cache.remove(key_view_type{user, prep_cache_key}, key_view_hasher(), key_view_equal());
iterator it = find(user, prep_cache_key);
_cache.remove(it);
}
size_t size() const {

View File

@@ -59,8 +59,6 @@ class result_message;
namespace cql3 {
class query_processor;
class metadata;
shared_ptr<const metadata> make_empty_metadata();
@@ -101,9 +99,11 @@ public:
* @param options options for this query (consistency, variables, pageSize, ...)
*/
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute(query_processor& qp, service::query_state& state, const query_options& options) const = 0;
execute(service::storage_proxy& proxy, service::query_state& state, const query_options& options) const = 0;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const = 0;
virtual bool depends_on_keyspace(const sstring& ks_name) const = 0;
virtual bool depends_on_column_family(const sstring& cf_name) const = 0;
virtual shared_ptr<const metadata> get_result_metadata() const = 0;

View File

@@ -64,7 +64,7 @@ bytes_opt do_get_value(const schema& schema,
}
assert(cdef.is_atomic());
auto c = cell->as_atomic_cell(cdef);
return c.is_dead(now) ? std::nullopt : bytes_opt(to_bytes(c.value()));
return c.is_dead(now) ? std::nullopt : bytes_opt(c.value().linearize());
}
}
@@ -611,6 +611,22 @@ value_list get_IN_values(const ::shared_ptr<term>& t, size_t k, const query_opti
static constexpr bool inclusive = true, exclusive = false;
/// A range of all X such that X op val.
nonwrapping_range<bytes> to_range(oper_t op, const bytes& val) {
switch (op) {
case oper_t::GT:
return nonwrapping_range<bytes>::make_starting_with(interval_bound(val, exclusive));
case oper_t::GTE:
return nonwrapping_range<bytes>::make_starting_with(interval_bound(val, inclusive));
case oper_t::LT:
return nonwrapping_range<bytes>::make_ending_with(interval_bound(val, exclusive));
case oper_t::LTE:
return nonwrapping_range<bytes>::make_ending_with(interval_bound(val, inclusive));
default:
throw std::logic_error(format("to_range: unknown comparison operator {}", op));
}
}
} // anonymous namespace
expression make_conjunction(expression a, expression b) {
@@ -639,7 +655,7 @@ bool is_satisfied_by(
std::vector<bytes_opt> first_multicolumn_bound(
const expression& restr, const query_options& options, statements::bound bnd) {
auto found = find_atom(restr, [bnd] (const binary_operator& oper) {
return matches(oper.op, bnd) && is_multi_column(oper);
return matches(oper.op, bnd) && std::holds_alternative<std::vector<column_value>>(oper.lhs);
});
if (found) {
return static_pointer_cast<tuples::value>(found->rhs->bind(options))->get_elements();
@@ -648,27 +664,6 @@ std::vector<bytes_opt> first_multicolumn_bound(
}
}
template<typename T>
nonwrapping_range<T> to_range(oper_t op, const T& val) {
static constexpr bool inclusive = true, exclusive = false;
switch (op) {
case oper_t::EQ:
return nonwrapping_range<T>::make_singular(val);
case oper_t::GT:
return nonwrapping_range<T>::make_starting_with(interval_bound(val, exclusive));
case oper_t::GTE:
return nonwrapping_range<T>::make_starting_with(interval_bound(val, inclusive));
case oper_t::LT:
return nonwrapping_range<T>::make_ending_with(interval_bound(val, exclusive));
case oper_t::LTE:
return nonwrapping_range<T>::make_ending_with(interval_bound(val, inclusive));
default:
throw std::logic_error(format("to_range: unknown comparison operator {}", op));
}
}
template nonwrapping_range<clustering_key_prefix> to_range(oper_t, const clustering_key_prefix&);
value_set possible_lhs_values(const column_definition* cdef, const expression& expr, const query_options& options) {
const auto type = cdef ? get_value_comparator(cdef) : long_type.get();
return std::visit(overloaded_functor{
@@ -812,7 +807,7 @@ bool has_supporting_index(
}
std::ostream& operator<<(std::ostream& os, const column_value& cv) {
os << cv.col->name_as_text();
os << *cv.col;
if (cv.sub) {
os << '[' << *cv.sub << ']';
}
@@ -827,10 +822,10 @@ std::ostream& operator<<(std::ostream& os, const expression& expr) {
std::visit(overloaded_functor{
[&] (const token& t) { os << "TOKEN"; },
[&] (const column_value& col) {
fmt::print(os, "{}", col);
fmt::print(os, "({})", col);
},
[&] (const std::vector<column_value>& cvs) {
fmt::print(os, "({})", fmt::join(cvs, ","));
fmt::print(os, "(({}))", fmt::join(cvs, ","));
},
}, opr.lhs);
os << ' ' << opr.op << ' ' << *opr.rhs;

View File

@@ -73,18 +73,11 @@ struct token {};
enum class oper_t { EQ, NEQ, LT, LTE, GTE, GT, IN, CONTAINS, CONTAINS_KEY, IS_NOT, LIKE };
/// Describes the nature of clustering-key comparisons. Useful for implementing SCYLLA_CLUSTERING_BOUND.
enum class comparison_order : char {
cql, ///< CQL order. (a,b)>(1,1) is equivalent to a>1 OR (a=1 AND b>1).
clustering, ///< Table's clustering order. (a,b)>(1,1) means any row past (1,1) in storage.
};
/// Operator restriction: LHS op RHS.
struct binary_operator {
std::variant<column_value, std::vector<column_value>, token> lhs;
oper_t op;
::shared_ptr<term> rhs;
comparison_order order = comparison_order::cql;
};
/// A conjunction of restrictions.
@@ -138,10 +131,6 @@ extern value_set possible_lhs_values(const column_definition*, const expression&
/// Turns value_set into a range, unless it's a multi-valued list (in which case this throws).
extern nonwrapping_range<bytes> to_range(const value_set&);
/// A range of all X such that X op val.
template<typename T>
nonwrapping_range<T> to_range(oper_t op, const T& val);
/// True iff the index can support the entire expression.
extern bool is_supported_by(const expression&, const secondary_index::index&);
@@ -193,8 +182,7 @@ inline const binary_operator* find(const expression& e, oper_t op) {
}
inline bool needs_filtering(oper_t op) {
return (op == oper_t::CONTAINS) || (op == oper_t::CONTAINS_KEY) || (op == oper_t::LIKE) ||
(op == oper_t::IS_NOT) || (op == oper_t::NEQ) ;
return (op == oper_t::CONTAINS) || (op == oper_t::CONTAINS_KEY) || (op == oper_t::LIKE);
}
inline auto find_needs_filtering(const expression& e) {
@@ -223,10 +211,6 @@ inline bool is_compare(oper_t op) {
}
}
inline bool is_multi_column(const binary_operator& op) {
return holds_alternative<std::vector<column_value>>(op.lhs);
}
inline bool has_token(const expression& e) {
return find_atom(e, [] (const binary_operator& o) { return std::holds_alternative<token>(o.lhs); });
}
@@ -235,14 +219,6 @@ inline bool has_slice_or_needs_filtering(const expression& e) {
return find_atom(e, [] (const binary_operator& o) { return is_slice(o.op) || needs_filtering(o.op); });
}
inline bool is_clustering_order(const binary_operator& op) {
return op.order == comparison_order::clustering;
}
inline auto find_clustering_order(const expression& e) {
return find_atom(e, is_clustering_order);
}
/// True iff binary_operator involves a collection.
extern bool is_on_collection(const binary_operator&);

View File

@@ -24,7 +24,6 @@
#include "error_injection_fcts.hh"
#include "utils/error_injection.hh"
#include "types/list.hh"
#include <seastar/core/map_reduce.hh>
namespace cql3
{

View File

@@ -54,7 +54,7 @@ std::ostream& operator<<(std::ostream& os, const std::vector<data_type>& arg_typ
namespace cql3 {
namespace functions {
logging::logger log("cql3_fuctions");
static logging::logger log("cql3_fuctions");
bool abstract_function::requires_thread() const { return false; }

View File

@@ -21,13 +21,9 @@
#include "user_function.hh"
#include "lua.hh"
#include "log.hh"
namespace cql3 {
namespace functions {
extern logging::logger log;
user_function::user_function(function_name name, std::vector<data_type> arg_types, std::vector<sstring> arg_names,
sstring body, sstring language, data_type return_type, bool called_on_null_input, sstring bitcode,
lua::runtime_config cfg)
@@ -60,9 +56,7 @@ bytes_opt user_function::execute(cql_serialization_format sf, const std::vector<
}
values.push_back(bytes ? type->deserialize(*bytes) : data_value::make_null(type));
}
if (!seastar::thread::running_in_thread()) {
on_internal_error(log, "User function cannot be executed in this context");
}
return lua::run_script(lua::bitcode_view{_bitcode}, values, return_type(), _cfg).get0();
}
}

View File

@@ -53,11 +53,11 @@ const sstring& index_name::get_idx() const
return _idx_name;
}
cf_name index_name::get_cf_name() const
::shared_ptr<cf_name> index_name::get_cf_name() const
{
cf_name cf;
auto cf = ::make_shared<cf_name>();
if (has_keyspace()) {
cf.set_keyspace(get_keyspace(), true);
cf->set_keyspace(get_keyspace(), true);
}
return cf;
}

View File

@@ -55,7 +55,7 @@ public:
const sstring& get_idx() const;
cf_name get_cf_name() const;
::shared_ptr<cf_name> get_cf_name() const;
virtual sstring to_string() const override;
};

View File

@@ -55,7 +55,6 @@ bool keyspace_element_name::has_keyspace() const
const sstring& keyspace_element_name::get_keyspace() const
{
assert(_ks_name);
return *_ks_name;
}

View File

@@ -25,8 +25,8 @@
#include "cql3_type.hh"
#include "constants.hh"
#include <boost/iterator/transform_iterator.hpp>
#include <boost/range/adaptor/reversed.hpp>
#include "types/list.hh"
#include "utils/UUID_gen.hh"
namespace cql3 {
@@ -236,6 +236,20 @@ lists::marker::bind(const query_options& options) {
}
}
constexpr db_clock::time_point lists::precision_time::REFERENCE_TIME;
thread_local lists::precision_time lists::precision_time::_last = {db_clock::time_point::max(), 0};
lists::precision_time
lists::precision_time::get_next(db_clock::time_point millis) {
// FIXME: and if time goes backwards?
assert(millis <= _last.millis);
auto next = millis < _last.millis
? precision_time{millis, 9999}
: precision_time{millis, std::max(0, _last.nanos - 1)};
_last = next;
return next;
}
void
lists::setter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
auto value = _t->bind(params._options);
@@ -373,18 +387,10 @@ lists::do_append(shared_ptr<term> value,
collection_mutation_description appended;
appended.cells.reserve(to_add.size());
for (auto&& e : to_add) {
try {
auto uuid1 = utils::UUID_gen::get_time_UUID_bytes_from_micros_and_submicros(
params.timestamp(),
params._options.next_list_append_seq());
auto uuid = bytes(reinterpret_cast<const int8_t*>(uuid1.data()), uuid1.size());
// FIXME: can e be empty?
appended.cells.emplace_back(
std::move(uuid),
params.make_cell(*ltype->value_comparator(), *e, atomic_cell::collection_member::yes));
} catch (utils::timeuuid_submicro_out_of_range) {
throw exceptions::invalid_request_exception("Too many list values per single CQL statement or batch");
}
auto uuid1 = utils::UUID_gen::get_time_UUID_bytes();
auto uuid = bytes(reinterpret_cast<const int8_t*>(uuid1.data()), uuid1.size());
// FIXME: can e be empty?
appended.cells.emplace_back(std::move(uuid), params.make_cell(*ltype->value_comparator(), *e, atomic_cell::collection_member::yes));
}
m.set_cell(prefix, column, appended.serialize(*ltype));
} else {
@@ -408,42 +414,20 @@ lists::prepender::execute(mutation& m, const clustering_key_prefix& prefix, cons
auto&& lvalue = dynamic_pointer_cast<lists::value>(std::move(value));
assert(lvalue);
// For prepend we need to be able to generate a unique but decreasing
// timeuuid. We achieve that by by using a time in the past which
// is 2x the distance between the original timestamp (it
// would be the current timestamp, user supplied timestamp, or
// unique monotonic LWT timestsamp, whatever is in query
// options) and a reference time of Jan 1 2010 00:00:00.
// E.g. if query timestamp is Jan 1 2020 00:00:00, the prepend
// timestamp will be Jan 1, 2000, 00:00:00.
// 2010-01-01T00:00:00+00:00 in api::timestamp_time format (microseconds)
static constexpr int64_t REFERENCE_TIME_MICROS = 1262304000L * 1000 * 1000;
int64_t micros = params.timestamp();
if (micros > REFERENCE_TIME_MICROS) {
micros = REFERENCE_TIME_MICROS - (micros - REFERENCE_TIME_MICROS);
} else {
// Scylla, unlike Cassandra, respects user-supplied timestamps
// in prepend, but there is nothing useful it can do with
// a timestamp less than Jan 1, 2010, 00:00:00.
throw exceptions::invalid_request_exception("List prepend custom timestamp must be greater than Jan 1 2010 00:00:00");
}
auto time = precision_time::REFERENCE_TIME - (db_clock::now() - precision_time::REFERENCE_TIME);
collection_mutation_description mut;
mut.cells.reserve(lvalue->get_elements().size());
// We reverse the order of insertion, so that the last element gets the lastest time
// (lists are sorted by time)
auto ltype = static_cast<const list_type_impl*>(column.type.get());
int clockseq = params._options.next_list_prepend_seq(lvalue->_elements.size(), utils::UUID_gen::SUBMICRO_LIMIT);
for (auto&& v : lvalue->_elements) {
try {
auto uuid = utils::UUID_gen::get_time_UUID_bytes_from_micros_and_submicros(micros, clockseq++);
mut.cells.emplace_back(bytes(uuid.data(), uuid.size()), params.make_cell(*ltype->value_comparator(), *v, atomic_cell::collection_member::yes));
} catch (utils::timeuuid_submicro_out_of_range) {
throw exceptions::invalid_request_exception("Too many list values per single CQL statement or batch");
}
for (auto&& v : lvalue->_elements | boost::adaptors::reversed) {
auto&& pt = precision_time::get_next(time);
auto uuid = utils::UUID_gen::get_time_UUID_bytes(pt.millis.time_since_epoch().count(), pt.nanos);
mut.cells.emplace_back(bytes(uuid.data(), uuid.size()), params.make_cell(*ltype->value_comparator(), *v, atomic_cell::collection_member::yes));
}
// now reverse again, to get the original order back
std::reverse(mut.cells.begin(), mut.cells.end());
m.set_cell(prefix, column, mut.serialize(*ltype));
}

View File

@@ -43,6 +43,7 @@
#include "cql3/abstract_marker.hh"
#include "to_string.hh"
#include "utils/UUID_gen.hh"
#include "operation.hh"
namespace cql3 {
@@ -120,6 +121,28 @@ public:
virtual ::shared_ptr<terminal> bind(const query_options& options) override;
};
/*
* For prepend, we need to be able to generate unique but decreasing time
* UUID, which is a bit challenging. To do that, given a time in milliseconds,
* we adds a number representing the 100-nanoseconds precision and make sure
* that within the same millisecond, that number is always decreasing. We
* do rely on the fact that the user will only provide decreasing
* milliseconds timestamp for that purpose.
*/
private:
class precision_time {
public:
// Our reference time (1 jan 2010, 00:00:00) in milliseconds.
static constexpr db_clock::time_point REFERENCE_TIME{std::chrono::milliseconds(1262304000000)};
private:
static thread_local precision_time _last;
public:
db_clock::time_point millis;
int32_t nanos;
static precision_time get_next(db_clock::time_point millis);
};
public:
class setter : public operation {
public:

View File

@@ -310,7 +310,7 @@ maps::setter_by_key::execute(mutation& m, const clustering_key_prefix& prefix, c
if (value.is_unset_value()) {
return;
}
if (key.is_unset_value()) {
if (key.is_unset_value() || value.is_unset_value()) {
throw invalid_request_exception("Invalid unset map key");
}
if (!key) {

View File

@@ -59,32 +59,30 @@ namespace cql3 {
* - SELECT ... WHERE (a, b) IN ?
*/
class multi_column_relation final : public relation {
public:
using mode = expr::comparison_order;
private:
std::vector<shared_ptr<column_identifier::raw>> _entities;
shared_ptr<term::multi_column_raw> _values_or_marker;
std::vector<shared_ptr<term::multi_column_raw>> _in_values;
shared_ptr<tuples::in_raw> _in_marker;
mode _mode;
public:
multi_column_relation(std::vector<shared_ptr<column_identifier::raw>> entities,
expr::oper_t relation_type, shared_ptr<term::multi_column_raw> values_or_marker,
std::vector<shared_ptr<term::multi_column_raw>> in_values, shared_ptr<tuples::in_raw> in_marker, mode m = mode::cql)
std::vector<shared_ptr<term::multi_column_raw>> in_values, shared_ptr<tuples::in_raw> in_marker)
: relation(relation_type)
, _entities(std::move(entities))
, _values_or_marker(std::move(values_or_marker))
, _in_values(std::move(in_values))
, _in_marker(std::move(in_marker))
, _mode(m)
{ }
static shared_ptr<multi_column_relation> create_multi_column_relation(
std::vector<shared_ptr<column_identifier::raw>> entities, expr::oper_t relation_type,
shared_ptr<term::multi_column_raw> values_or_marker, std::vector<shared_ptr<term::multi_column_raw>> in_values,
shared_ptr<tuples::in_raw> in_marker, mode m = mode::cql) {
shared_ptr<tuples::in_raw> in_marker) {
return ::make_shared<multi_column_relation>(std::move(entities), relation_type, std::move(values_or_marker),
std::move(in_values), std::move(in_marker), m);
std::move(in_values), std::move(in_marker));
}
/**
@@ -101,15 +99,6 @@ public:
return create_multi_column_relation(std::move(entities), relation_type, std::move(values_or_marker), {}, {});
}
/**
* Same as above, but sets the magic mode that causes us to treat the restrictions as "raw" clustering bounds
*/
static shared_ptr<multi_column_relation> create_scylla_clustering_bound_non_in_relation(std::vector<shared_ptr<column_identifier::raw>> entities,
expr::oper_t relation_type, shared_ptr<term::multi_column_raw> values_or_marker) {
assert(relation_type != expr::oper_t::IN);
return create_multi_column_relation(std::move(entities), relation_type, std::move(values_or_marker), {}, {}, mode::clustering);
}
/**
* Creates a multi-column IN relation with a list of IN values or markers.
* For example: "SELECT ... WHERE (a, b) IN ((0, 1), (2, 3))"
@@ -202,7 +191,7 @@ protected:
return cs->column_specification;
});
auto t = to_term(col_specs, *get_value(), db, schema->ks_name(), bound_names);
return ::make_shared<restrictions::multi_column_restriction::slice>(schema, rs, bound, inclusive, t, _mode);
return ::make_shared<restrictions::multi_column_restriction::slice>(schema, rs, bound, inclusive, t);
}
virtual shared_ptr<restrictions::restriction> new_contains_restriction(database& db, schema_ptr schema,

View File

@@ -102,7 +102,13 @@ private:
using cache_key_type = typename prepared_cache_key_type::cache_key_type;
using cache_type = utils::loading_cache<cache_key_type, prepared_cache_entry, utils::loading_cache_reload_enabled::no, prepared_cache_entry_size, utils::tuple_hash, std::equal_to<cache_key_type>, prepared_cache_stats_updater>;
using cache_value_ptr = typename cache_type::value_ptr;
using cache_iterator = typename cache_type::iterator;
using checked_weak_ptr = typename statements::prepared_statement::checked_weak_ptr;
struct value_extractor_fn {
checked_weak_ptr operator()(prepared_cache_entry& e) const {
return e->checked_weak_from_this();
}
};
public:
static const std::chrono::minutes entry_expiry;
@@ -110,9 +116,12 @@ public:
using key_type = prepared_cache_key_type;
using value_type = checked_weak_ptr;
using statement_is_too_big = typename cache_type::entry_is_too_big;
/// \note both iterator::reference and iterator::value_type are checked_weak_ptr
using iterator = boost::transform_iterator<value_extractor_fn, cache_iterator>;
private:
cache_type _cache;
value_extractor_fn _value_extractor_fn;
public:
prepared_statements_cache(logging::logger& logger, size_t size)
@@ -126,12 +135,16 @@ public:
});
}
value_type find(const key_type& key) {
cache_value_ptr vp = _cache.find(key.key());
if (vp) {
return (*vp)->checked_weak_from_this();
}
return value_type();
iterator find(const key_type& key) {
return boost::make_transform_iterator(_cache.find(key.key()), _value_extractor_fn);
}
iterator end() {
return boost::make_transform_iterator(_cache.end(), _value_extractor_fn);
}
iterator begin() {
return boost::make_transform_iterator(_cache.begin(), _value_extractor_fn);
}
template <typename Pred>

View File

@@ -52,11 +52,12 @@ thread_local const query_options::specific_options query_options::specific_optio
-1, {}, db::consistency_level::SERIAL, api::missing_timestamp};
thread_local query_options query_options::DEFAULT{default_cql_config,
db::consistency_level::ONE, std::nullopt,
db::consistency_level::ONE, infinite_timeout_config, std::nullopt,
std::vector<cql3::raw_value_view>(), false, query_options::specific_options::DEFAULT, cql_serialization_format::latest()};
query_options::query_options(const cql_config& cfg,
db::consistency_level consistency,
const ::timeout_config& timeout_config,
std::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
std::vector<cql3::raw_value_view> value_views,
@@ -65,6 +66,7 @@ query_options::query_options(const cql_config& cfg,
cql_serialization_format sf)
: _cql_config(cfg)
, _consistency(consistency)
, _timeout_config(timeout_config)
, _names(std::move(names))
, _values(std::move(values))
, _value_views(value_views)
@@ -76,6 +78,7 @@ query_options::query_options(const cql_config& cfg,
query_options::query_options(const cql_config& cfg,
db::consistency_level consistency,
const ::timeout_config& timeout_config,
std::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
bool skip_metadata,
@@ -83,6 +86,7 @@ query_options::query_options(const cql_config& cfg,
cql_serialization_format sf)
: _cql_config(cfg)
, _consistency(consistency)
, _timeout_config(timeout_config)
, _names(std::move(names))
, _values(std::move(values))
, _value_views()
@@ -95,6 +99,7 @@ query_options::query_options(const cql_config& cfg,
query_options::query_options(const cql_config& cfg,
db::consistency_level consistency,
const ::timeout_config& timeout_config,
std::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value_view> value_views,
bool skip_metadata,
@@ -102,6 +107,7 @@ query_options::query_options(const cql_config& cfg,
cql_serialization_format sf)
: _cql_config(cfg)
, _consistency(consistency)
, _timeout_config(timeout_config)
, _names(std::move(names))
, _values()
, _value_views(std::move(value_views))
@@ -111,11 +117,12 @@ query_options::query_options(const cql_config& cfg,
{
}
query_options::query_options(db::consistency_level cl, std::vector<cql3::raw_value> values,
query_options::query_options(db::consistency_level cl, const ::timeout_config& timeout_config, std::vector<cql3::raw_value> values,
specific_options options)
: query_options(
default_cql_config,
cl,
timeout_config,
{},
std::move(values),
false,
@@ -128,6 +135,7 @@ query_options::query_options(db::consistency_level cl, std::vector<cql3::raw_val
query_options::query_options(std::unique_ptr<query_options> qo, lw_shared_ptr<service::pager::paging_state> paging_state)
: query_options(qo->_cql_config,
qo->_consistency,
qo->get_timeout_config(),
std::move(qo->_names),
std::move(qo->_values),
std::move(qo->_value_views),
@@ -140,6 +148,7 @@ query_options::query_options(std::unique_ptr<query_options> qo, lw_shared_ptr<se
query_options::query_options(std::unique_ptr<query_options> qo, lw_shared_ptr<service::pager::paging_state> paging_state, int32_t page_size)
: query_options(qo->_cql_config,
qo->_consistency,
qo->get_timeout_config(),
std::move(qo->_names),
std::move(qo->_values),
std::move(qo->_value_views),
@@ -151,7 +160,7 @@ query_options::query_options(std::unique_ptr<query_options> qo, lw_shared_ptr<se
query_options::query_options(std::vector<cql3::raw_value> values)
: query_options(
db::consistency_level::ONE, std::move(values))
db::consistency_level::ONE, infinite_timeout_config, std::move(values))
{}
void query_options::prepare(const std::vector<lw_shared_ptr<column_specification>>& specs)

View File

@@ -51,6 +51,7 @@
#include "cql3/column_identifier.hh"
#include "cql3/values.hh"
#include "cql_serialization_format.hh"
#include "timeout_config.hh"
namespace cql3 {
@@ -74,6 +75,7 @@ public:
private:
const cql_config& _cql_config;
const db::consistency_level _consistency;
const timeout_config& _timeout_config;
const std::optional<std::vector<sstring_view>> _names;
std::vector<cql3::raw_value> _values;
std::vector<cql3::raw_value_view> _value_views;
@@ -81,16 +83,7 @@ private:
const specific_options _options;
cql_serialization_format _cql_serialization_format;
std::optional<std::vector<query_options>> _batch_options;
// We must use the same microsecond-precision timestamp for
// all cells created by an LWT statement or when a statement
// has a user-provided timestamp. In case the statement or
// a BATCH appends many values to a list, each value should
// get a unique and monotonic timeuuid. This sequence is
// used to make all time-based UUIDs:
// 1) share the same microsecond,
// 2) monotonic
// 3) unique.
mutable int _list_append_seq = 0;
private:
/**
* @brief Batch query_options constructor.
@@ -116,6 +109,7 @@ public:
explicit query_options(const cql_config& cfg,
db::consistency_level consistency,
const timeout_config& timeouts,
std::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
bool skip_metadata,
@@ -123,6 +117,7 @@ public:
cql_serialization_format sf);
explicit query_options(const cql_config& cfg,
db::consistency_level consistency,
const timeout_config& timeouts,
std::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
std::vector<cql3::raw_value_view> value_views,
@@ -131,6 +126,7 @@ public:
cql_serialization_format sf);
explicit query_options(const cql_config& cfg,
db::consistency_level consistency,
const timeout_config& timeouts,
std::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value_view> value_views,
bool skip_metadata,
@@ -162,10 +158,13 @@ public:
// forInternalUse
explicit query_options(std::vector<cql3::raw_value> values);
explicit query_options(db::consistency_level, std::vector<cql3::raw_value> values, specific_options options = specific_options::DEFAULT);
explicit query_options(db::consistency_level, const timeout_config& timeouts,
std::vector<cql3::raw_value> values, specific_options options = specific_options::DEFAULT);
explicit query_options(std::unique_ptr<query_options>, lw_shared_ptr<service::pager::paging_state> paging_state);
explicit query_options(std::unique_ptr<query_options>, lw_shared_ptr<service::pager::paging_state> paging_state, int32_t page_size);
const timeout_config& get_timeout_config() const { return _timeout_config; }
db::consistency_level get_consistency() const {
return _consistency;
}
@@ -242,39 +241,6 @@ public:
return _cql_config;
}
// Generate a next unique list sequence for list append, e.g.
// a = a + [val1, val2, ...]
int next_list_append_seq() const {
return _list_append_seq++;
}
// To preserve prepend monotonicity within a batch, each next
// value must get a timestamp that's smaller than the previous one:
// BEGIN BATCH
// UPDATE t SET l = [1, 2] + l WHERE pk = 0;
// UPDATE t SET l = [3] + l WHERE pk = 0;
// UPDATE t SET l = [4] + l WHERE pk = 0;
// APPLY BATCH
// SELECT l FROM t WHERE pk = 0;
// l
// ------------
// [4, 3, 1, 2]
//
// This function reserves the given number of prepend entries
// and returns an id for the first prepended entry (it
// got to be the smallest one, to preserve the order of
// a multi-value append).
//
// @retval sequence number of the first entry of a multi-value
// append. To get the next value, add 1.
int next_list_prepend_seq(int num_entries, int max_entries) const {
if (_list_append_seq + num_entries < max_entries) {
_list_append_seq += num_entries;
return max_entries - _list_append_seq;
}
return max_entries;
}
void prepare(const std::vector<lw_shared_ptr<column_specification>>& specs);
private:
void fill_value_views();
@@ -292,7 +258,7 @@ query_options::query_options(query_options&& o, std::vector<OneMutationDataRange
std::vector<query_options> tmp;
tmp.reserve(values_ranges.size());
std::transform(values_ranges.begin(), values_ranges.end(), std::back_inserter(tmp), [this](auto& values_range) {
return query_options(_cql_config, _consistency, {}, std::move(values_range), _skip_metadata, _options, _cql_serialization_format);
return query_options(_cql_config, _consistency, _timeout_config, {}, std::move(values_range), _skip_metadata, _options, _cql_serialization_format);
});
_batch_options = std::move(tmp);
}

View File

@@ -84,12 +84,11 @@ public:
}
};
query_processor::query_processor(service::storage_proxy& proxy, database& db, service::migration_notifier& mn, service::migration_manager& mm, query_processor::memory_config mcfg, cql_config& cql_cfg)
query_processor::query_processor(service::storage_proxy& proxy, database& db, service::migration_notifier& mn, query_processor::memory_config mcfg, cql_config& cql_cfg)
: _migration_subscriber{std::make_unique<migration_subscriber>(this)}
, _proxy(proxy)
, _db(db)
, _mnotifier(mn)
, _mm(mm)
, _cql_config(cql_cfg)
, _internal_state(new internal_state())
, _prepared_cache(prep_cache_log, mcfg.prepared_statment_cache_size)
@@ -528,7 +527,7 @@ query_processor::process_authorized_statement(const ::shared_ptr<cql_statement>
statement->validate(_proxy, client_state);
auto fut = statement->execute(*this, query_state, options);
auto fut = statement->execute(_proxy, query_state, options);
return fut.then([statement] (auto msg) {
if (msg) {
@@ -620,6 +619,7 @@ query_options query_processor::make_internal_options(
const statements::prepared_statement::checked_weak_ptr& p,
const std::initializer_list<data_value>& values,
db::consistency_level cl,
const timeout_config& timeout_config,
int32_t page_size) const {
if (p->bound_names.size() != values.size()) {
throw std::invalid_argument(
@@ -643,10 +643,11 @@ query_options query_processor::make_internal_options(
api::timestamp_type ts = api::missing_timestamp;
return query_options(
cl,
timeout_config,
bound_values,
cql3::query_options::specific_options{page_size, std::move(paging_state), serial_consistency, ts});
}
return query_options(cl, bound_values);
return query_options(cl, timeout_config, bound_values);
}
statements::prepared_statement::checked_weak_ptr query_processor::prepare_internal(const sstring& query_string) {
@@ -670,10 +671,11 @@ struct internal_query_state {
::shared_ptr<internal_query_state> query_processor::create_paged_state(
const sstring& query_string,
db::consistency_level cl,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& values,
int32_t page_size) {
auto p = prepare_internal(query_string);
auto opts = make_internal_options(p, values, cl, page_size);
auto opts = make_internal_options(p, values, cl, timeout_config, page_size);
::shared_ptr<internal_query_state> res = ::make_shared<internal_query_state>(
internal_query_state{
query_string,
@@ -752,7 +754,7 @@ future<> query_processor::for_each_cql_result(
future<::shared_ptr<untyped_result_set>>
query_processor::execute_paged_internal(::shared_ptr<internal_query_state> state) {
return state->p->statement->execute(*this, *_internal_state, *state->opts).then(
return state->p->statement->execute(_proxy, *_internal_state, *state->opts).then(
[state, this](::shared_ptr<cql_transport::messages::result_message> msg) mutable {
class visitor : public result_message::visitor_base {
::shared_ptr<internal_query_state> _state;
@@ -791,16 +793,7 @@ future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(
const sstring& query_string,
db::consistency_level cl,
const std::initializer_list<data_value>& values,
bool cache) {
return execute_internal(query_string, cl, *_internal_state, values, cache);
}
future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(
const sstring& query_string,
db::consistency_level cl,
service::query_state& query_state,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& values,
bool cache) {
@@ -808,13 +801,13 @@ query_processor::execute_internal(
log.trace("execute_internal: {}\"{}\" ({})", cache ? "(cached) " : "", query_string, ::join(", ", values));
}
if (cache) {
return execute_with_params(prepare_internal(query_string), cl, query_state, values);
return execute_with_params(prepare_internal(query_string), cl, timeout_config, values);
} else {
auto p = parse_statement(query_string)->prepare(_db, _cql_stats);
p->statement->raw_cql_statement = query_string;
p->statement->validate(_proxy, *_internal_state);
auto checked_weak_ptr = p->checked_weak_from_this();
return execute_with_params(std::move(checked_weak_ptr), cl, query_state, values).finally([p = std::move(p)] {});
return execute_with_params(std::move(checked_weak_ptr), cl, timeout_config, values).finally([p = std::move(p)] {});
}
}
@@ -822,11 +815,11 @@ future<::shared_ptr<untyped_result_set>>
query_processor::execute_with_params(
statements::prepared_statement::checked_weak_ptr p,
db::consistency_level cl,
service::query_state& query_state,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& values) {
auto opts = make_internal_options(p, values, cl);
return do_with(std::move(opts), [this, &query_state, p = std::move(p)](auto & opts) {
return p->statement->execute(*this, query_state, opts).then([](auto msg) {
auto opts = make_internal_options(p, values, cl, timeout_config);
return do_with(std::move(opts), [this, p = std::move(p)](auto & opts) {
return p->statement->execute(_proxy, *_internal_state, opts).then([](auto msg) {
return make_ready_future<::shared_ptr<untyped_result_set>>(::make_shared<untyped_result_set>(msg));
});
});
@@ -854,7 +847,7 @@ query_processor::execute_batch(
}
log.trace("execute_batch({}): {}", batch->get_statements().size(), oss.str());
}
return batch->execute(*this, query_state, options);
return batch->execute(_proxy, query_state, options);
});
});
}
@@ -943,22 +936,23 @@ bool query_processor::migration_subscriber::should_invalidate(
sstring ks_name,
std::optional<sstring> cf_name,
::shared_ptr<cql_statement> statement) {
return statement->depends_on(ks_name, cf_name);
return statement->depends_on_keyspace(ks_name) && (!cf_name || statement->depends_on_column_family(*cf_name));
}
future<> query_processor::query_internal(
const sstring& query_string,
db::consistency_level cl,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f) {
return for_each_cql_result(create_paged_state(query_string, cl, values, page_size), std::move(f));
return for_each_cql_result(create_paged_state(query_string, cl, timeout_config, values, page_size), std::move(f));
}
future<> query_processor::query_internal(
const sstring& query_string,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f) {
return query_internal(query_string, db::consistency_level::ONE, {}, 1000, std::move(f));
return query_internal(query_string, db::consistency_level::ONE, infinite_timeout_config, {}, 1000, std::move(f));
}
}

View File

@@ -58,10 +58,6 @@
#include "service/query_state.hh"
#include "transport/messages/result_message.hh"
namespace service {
class migration_manager;
}
namespace cql3 {
namespace statements {
@@ -119,7 +115,6 @@ private:
service::storage_proxy& _proxy;
database& _db;
service::migration_notifier& _mnotifier;
service::migration_manager& _mm;
const cql_config& _cql_config;
struct stats {
@@ -154,7 +149,7 @@ public:
static std::unique_ptr<statements::raw::parsed_statement> parse_statement(const std::string_view& query);
query_processor(service::storage_proxy& proxy, database& db, service::migration_notifier& mn, service::migration_manager& mm, memory_config mcfg, cql_config& cql_cfg);
query_processor(service::storage_proxy& proxy, database& db, service::migration_notifier& mn, memory_config mcfg, cql_config& cql_cfg);
~query_processor();
@@ -170,19 +165,16 @@ public:
return _proxy;
}
const service::migration_manager& get_migration_manager() const noexcept { return _mm; }
service::migration_manager& get_migration_manager() noexcept { return _mm; }
cql_stats& get_cql_stats() {
return _cql_stats;
}
statements::prepared_statement::checked_weak_ptr get_prepared(const std::optional<auth::authenticated_user>& user, const prepared_cache_key_type& key) {
if (user) {
auto vp = _authorized_prepared_cache.find(*user, key);
if (vp) {
auto it = _authorized_prepared_cache.find(*user, key);
if (it != _authorized_prepared_cache.end()) {
try {
return vp->get()->checked_weak_from_this();
return it->get()->checked_weak_from_this();
} catch (seastar::checked_ptr_is_null_exception&) {
// If the prepared statement got invalidated - remove the corresponding authorized_prepared_statements_cache entry as well.
_authorized_prepared_cache.remove(*user, key);
@@ -193,7 +185,11 @@ public:
}
statements::prepared_statement::checked_weak_ptr get_prepared(const prepared_cache_key_type& key) {
return _prepared_cache.find(key);
auto it = _prepared_cache.find(key);
if (it == _prepared_cache.end()) {
return statements::prepared_statement::checked_weak_ptr();
}
return *it;
}
future<::shared_ptr<cql_transport::messages::result_message>>
@@ -219,7 +215,8 @@ public:
// creating namespaces, etc) is explicitly forbidden via this interface.
future<::shared_ptr<untyped_result_set>>
execute_internal(const sstring& query_string, const std::initializer_list<data_value>& values = { }) {
return execute_internal(query_string, db::consistency_level::ONE, values, true);
return execute_internal(query_string, db::consistency_level::ONE,
infinite_timeout_config, values, true);
}
statements::prepared_statement::checked_weak_ptr prepare_internal(const sstring& query);
@@ -237,6 +234,7 @@ public:
return query_internal(
"SELECT * from system.compaction_history",
db::consistency_level::ONE,
infinite_timeout_config,
{},
[&history] (const cql3::untyped_result_set::row& row) mutable {
....
@@ -248,6 +246,7 @@ public:
*
* query_string - the cql string, can contain placeholders
* cl - consistency level of the query
* timeout_config - timeout configuration
* values - values to be substituted for the placeholders in the query
* page_size - maximum page size
* f - a function to be run on each row of the query result,
@@ -256,6 +255,7 @@ public:
future<> query_internal(
const sstring& query_string,
db::consistency_level cl,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f);
@@ -282,19 +282,14 @@ public:
future<::shared_ptr<untyped_result_set>> execute_internal(
const sstring& query_string,
db::consistency_level,
const std::initializer_list<data_value>& = { },
bool cache = false);
future<::shared_ptr<untyped_result_set>> execute_internal(
const sstring& query_string,
db::consistency_level,
service::query_state& query_state,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& = { },
bool cache = false);
future<::shared_ptr<untyped_result_set>> execute_with_params(
statements::prepared_statement::checked_weak_ptr p,
db::consistency_level,
service::query_state& query_state,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& = { });
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
@@ -323,6 +318,7 @@ private:
const statements::prepared_statement::checked_weak_ptr& p,
const std::initializer_list<data_value>&,
db::consistency_level,
const timeout_config& timeout_config,
int32_t page_size = -1) const;
future<::shared_ptr<cql_transport::messages::result_message>>
@@ -336,6 +332,7 @@ private:
::shared_ptr<internal_query_state> create_paged_state(
const sstring& query_string,
db::consistency_level,
const timeout_config&,
const std::initializer_list<data_value>&,
int32_t page_size);

View File

@@ -377,24 +377,21 @@ protected:
class multi_column_restriction::slice final : public multi_column_restriction {
using restriction_shared_ptr = ::shared_ptr<clustering_key_restrictions>;
using mode = expr::comparison_order;
private:
term_slice _slice;
mode _mode;
slice(schema_ptr schema, std::vector<const column_definition*> defs, term_slice slice, mode m)
slice(schema_ptr schema, std::vector<const column_definition*> defs, term_slice slice)
: multi_column_restriction(schema, std::move(defs))
, _slice(slice)
, _mode(m)
{ }
public:
slice(schema_ptr schema, std::vector<const column_definition*> defs, statements::bound bound, bool inclusive, shared_ptr<term> term, mode m = mode::cql)
: slice(schema, defs, term_slice::new_instance(bound, inclusive, term), m)
slice(schema_ptr schema, std::vector<const column_definition*> defs, statements::bound bound, bool inclusive, shared_ptr<term> term)
: slice(schema, defs, term_slice::new_instance(bound, inclusive, term))
{
expression = expr::binary_operator{
std::vector<expr::column_value>(defs.cbegin(), defs.cend()),
expr::pick_operator(bound, inclusive),
std::move(term),
m};
std::move(term)};
}
virtual bool is_supported_by(const secondary_index::index& index) const override {
@@ -407,7 +404,7 @@ public:
}
virtual std::vector<bounds_range_type> bounds_ranges(const query_options& options) const override {
if (_mode == mode::clustering || !is_mixed_order()) {
if (!is_mixed_order()) {
return bounds_ranges_unified_order(options);
} else {
return bounds_ranges_mixed_order(options);
@@ -443,11 +440,6 @@ public:
get_columns_in_commons(other));
auto other_slice = static_pointer_cast<slice>(other);
static auto mode2str = [](auto m) { return m == mode::cql ? "plain" : "SCYLLA_CLUSTERING_BOUND"; };
check_true(other_slice->_mode == this->_mode,
"Invalid combination of restrictions (%s / %s)",
mode2str(this->_mode), mode2str(other_slice->_mode)
);
check_false(_slice.has_bound(statements::bound::START) && other_slice->_slice.has_bound(statements::bound::START),
"More than one restriction was found for the start bound on %s",
get_columns_in_commons(other));
@@ -492,7 +484,7 @@ private:
auto end_prefix = clustering_key_prefix::from_optional_exploded(*_schema, end_components);
end_bound = bounds_range_type::bound(std::move(end_prefix), _slice.is_inclusive(statements::bound::END));
}
if (_mode == mode::cql && !is_asc_order()) {
if (!is_asc_order()) {
std::swap(start_bound, end_bound);
}
auto range = bounds_range_type(start_bound, end_bound);

View File

@@ -34,7 +34,6 @@
#include "multi_column_restriction.hh"
#include "token_restriction.hh"
#include "database.hh"
#include "cartesian_product.hh"
#include "cql3/constants.hh"
#include "cql3/lists.hh"
@@ -133,7 +132,6 @@ statement_restrictions::statement_restrictions(schema_ptr schema, bool allow_fil
, _partition_key_restrictions(get_initial_partition_key_restrictions(allow_filtering))
, _clustering_columns_restrictions(get_initial_clustering_key_restrictions(allow_filtering))
, _nonprimary_key_restrictions(::make_shared<single_column_restrictions>(schema))
, _partition_range_is_simple(true)
{ }
#if 0
static const column_definition*
@@ -143,79 +141,6 @@ to_column_definition(const schema_ptr& schema, const ::shared_ptr<column_identif
}
#endif
/// Extracts where_clause atoms with clustering-column LHS and copies them to a vector such that:
/// 1. all elements must be simultaneously satisfied (as restrictions) for where_clause to be satisfied
/// 2. each element is an atom or a conjunction of atoms
/// 3. either all atoms (across all elements) are multi-column or they are all single-column
/// 4. if single-column, then:
/// 4.1 all atoms from an element have the same LHS, which we call the element's LHS
/// 4.2 each element's LHS is different from any other element's LHS
/// 4.3 the list of each element's LHS, in order, forms a clustering-key prefix
/// 4.4 elements other than the last have only EQ or IN atoms
/// 4.5 the last element has only EQ, IN, or is_slice() atoms
///
/// These elements define the boundaries of any clustering slice that can possibly meet where_clause.
///
/// This vector can be calculated before binding expression markers, since LHS and operator are always known.
static std::vector<expr::expression> extract_clustering_prefix_restrictions(
const expr::expression& where_clause, schema_ptr schema) {
using namespace expr;
/// Collects all clustering-column restrictions from an expression. Presumes the expression only uses
/// conjunction to combine subexpressions.
struct visitor {
std::vector<expression> multi; ///< All multi-column restrictions.
/// All single-clustering-column restrictions, grouped by column. Each value is either an atom or a
/// conjunction of atoms.
std::unordered_map<const column_definition*, expression> single;
void operator()(const conjunction& c) {
std::ranges::for_each(c.children, [this] (const expression& child) { std::visit(*this, child); });
}
void operator()(const binary_operator& b) {
if (std::holds_alternative<std::vector<column_value>>(b.lhs)) {
multi.push_back(b);
}
if (auto s = std::get_if<column_value>(&b.lhs)) {
if (s->col->is_clustering_key()) {
const auto found = single.find(s->col);
if (found == single.end()) {
single[s->col] = b;
} else {
found->second = make_conjunction(std::move(found->second), b);
}
}
}
}
void operator()(bool) {}
} v;
std::visit(v, where_clause);
if (!v.multi.empty()) {
return move(v.multi);
}
std::vector<expression> prefix;
for (const auto& col : schema->clustering_key_columns()) {
const auto found = v.single.find(&col);
if (found == v.single.end()) { // Any further restrictions are skipping the CK order.
break;
}
if (find_needs_filtering(found->second)) { // This column's restriction doesn't define a clear bound.
// TODO: if this is a conjunction of filtering and non-filtering atoms, we could split them and add the
// latter to the prefix.
break;
}
prefix.push_back(found->second);
if (has_slice(found->second)) {
break;
}
}
return prefix;
}
statement_restrictions::statement_restrictions(database& db,
schema_ptr schema,
statements::statement_type type,
@@ -226,6 +151,13 @@ statement_restrictions::statement_restrictions(database& db,
bool allow_filtering)
: statement_restrictions(schema, allow_filtering)
{
/*
* WHERE clause. For a given entity, rules are: - EQ relation conflicts with anything else (including a 2nd EQ)
* - Can't have more than one LT(E) relation (resp. GT(E) relation) - IN relation are restricted to row keys
* (for now) and conflicts with anything else (we could allow two IN for the same entity but that doesn't seem
* very useful) - The value_alias cannot be restricted in any way (we don't support wide rows with indexed value
* in CQL so far)
*/
if (!where_clause.empty()) {
for (auto&& relation : where_clause) {
if (relation->get_operator() == cql3::expr::oper_t::IS_NOT) {
@@ -255,9 +187,6 @@ statement_restrictions::statement_restrictions(database& db,
}
}
}
if (_where.has_value()) {
_clustering_prefix_restrictions = extract_clustering_prefix_restrictions(*_where, _schema);
}
auto& cf = db.find_column_family(schema);
auto& sim = cf.get_index_manager();
const expr::allow_local_index allow_local(
@@ -336,7 +265,7 @@ statement_restrictions::statement_restrictions(database& db,
}
if (!_nonprimary_key_restrictions->empty()) {
if (_has_queriable_regular_index && _partition_range_is_simple) {
if (_has_queriable_regular_index) {
_uses_secondary_indexing = true;
} else if (!allow_filtering) {
throw exceptions::invalid_request_exception("Cannot execute this query as it might involve data filtering and "
@@ -359,7 +288,6 @@ void statement_restrictions::add_restriction(::shared_ptr<restriction> restricti
} else {
add_single_column_restriction(::static_pointer_cast<single_column_restriction>(restriction), for_view, allow_filtering);
}
_where = _where.has_value() ? make_conjunction(std::move(*_where), restriction->expression) : restriction->expression;
}
void statement_restrictions::add_single_column_restriction(::shared_ptr<single_column_restriction> restriction, bool for_view, bool allow_filtering) {
@@ -378,7 +306,6 @@ void statement_restrictions::add_single_column_restriction(::shared_ptr<single_c
"Only EQ and IN relation are supported on the partition key (unless you use the token() function or allow filtering)");
}
_partition_key_restrictions = _partition_key_restrictions->merge_to(_schema, restriction);
_partition_range_is_simple &= !find(restriction->expression, expr::oper_t::IN);
} else if (def.is_clustering_key()) {
_clustering_columns_restrictions = _clustering_columns_restrictions->merge_to(_schema, restriction);
} else {
@@ -565,498 +492,17 @@ dht::partition_range_vector statement_restrictions::get_partition_key_ranges(con
return _partition_key_restrictions->bounds_ranges(options);
}
namespace {
using namespace expr;
clustering_key_prefix::prefix_equal_tri_compare get_unreversed_tri_compare(const schema& schema) {
clustering_key_prefix::prefix_equal_tri_compare cmp(schema);
std::vector<data_type> types = cmp.prefix_type->types();
for (auto& t : types) {
if (t->is_reversed()) {
t = t->underlying_type();
}
}
cmp.prefix_type = make_lw_shared<compound_type<allow_prefixes::yes>>(types);
return cmp;
}
/// True iff r1 start is strictly before r2 start.
bool starts_before_start(
const query::clustering_range& r1,
const query::clustering_range& r2,
const clustering_key_prefix::prefix_equal_tri_compare& cmp) {
if (!r2.start()) {
return false; // r2 start is -inf, nothing is before that.
}
if (!r1.start()) {
return true; // r1 start is -inf, while r2 start is finite.
}
const auto diff = cmp(r1.start()->value(), r2.start()->value());
if (diff < 0) { // r1 start is strictly before r2 start.
return true;
}
if (diff > 0) { // r1 start is strictly after r2 start.
return false;
}
const auto len1 = r1.start()->value().representation().size();
const auto len2 = r2.start()->value().representation().size();
if (len1 == len2) { // The values truly are equal.
return r1.start()->is_inclusive() && !r2.start()->is_inclusive();
} else if (len1 < len2) { // r1 start is a prefix of r2 start.
// (a)>=(1) starts before (a,b)>=(1,1), but (a)>(1) doesn't.
return r1.start()->is_inclusive();
} else { // r2 start is a prefix of r1 start.
// (a,b)>=(1,1) starts before (a)>(1) but after (a)>=(1).
return r2.start()->is_inclusive();
}
}
/// True iff r1 start is before (or identical as) r2 end.
bool starts_before_or_at_end(
const query::clustering_range& r1,
const query::clustering_range& r2,
const clustering_key_prefix::prefix_equal_tri_compare& cmp) {
if (!r1.start()) {
return true; // r1 start is -inf, must be before r2 end.
}
if (!r2.end()) {
return true; // r2 end is +inf, everything is before it.
}
const auto diff = cmp(r1.start()->value(), r2.end()->value());
if (diff < 0) { // r1 start is strictly before r2 end.
return true;
}
if (diff > 0) { // r1 start is strictly after r2 end.
return false;
}
const auto len1 = r1.start()->value().representation().size();
const auto len2 = r2.end()->value().representation().size();
if (len1 == len2) { // The values truly are equal.
return r1.start()->is_inclusive() && r2.end()->is_inclusive();
} else if (len1 < len2) { // r1 start is a prefix of r2 end.
// a>=(1) starts before (a,b)<=(1,1) ends, but (a)>(1) doesn't.
return r1.start()->is_inclusive();
} else { // r2 end is a prefix of r1 start.
// (a,b)>=(1,1) starts before (a)<=(1) ends but after (a)<(1) ends.
return r2.end()->is_inclusive();
}
}
/// True if r1 end is strictly before r2 end.
bool ends_before_end(
const query::clustering_range& r1,
const query::clustering_range& r2,
const clustering_key_prefix::prefix_equal_tri_compare& cmp) {
if (!r1.end()) {
return false; // r1 end is +inf, which is after everything.
}
if (!r2.end()) {
return true; // r2 end is +inf, while r1 end is finite.
}
const auto diff = cmp(r1.end()->value(), r2.end()->value());
if (diff < 0) { // r1 end is strictly before r2 end.
return true;
}
if (diff > 0) { // r1 end is strictly after r2 end.
return false;
}
const auto len1 = r1.end()->value().representation().size();
const auto len2 = r2.end()->value().representation().size();
if (len1 == len2) { // The values truly are equal.
return !r1.end()->is_inclusive() && r2.end()->is_inclusive();
} else if (len1 < len2) { // r1 end is a prefix of r2 end.
// (a)<(1) ends before (a,b)<=(1,1), but (a)<=(1) doesn't.
return !r1.end()->is_inclusive();
} else { // r2 end is a prefix of r1 end.
// (a,b)<=(1,1) ends before (a)<=(1) but after (a)<(1).
return r2.end()->is_inclusive();
}
}
/// Correct clustering_range intersection. See #8157.
std::optional<query::clustering_range> intersection(
const query::clustering_range& r1,
const query::clustering_range& r2,
const clustering_key_prefix::prefix_equal_tri_compare& cmp) {
// Assume r1's start is to the left of r2's start.
if (starts_before_start(r2, r1, cmp)) {
return intersection(r2, r1, cmp);
}
if (!starts_before_or_at_end(r2, r1, cmp)) {
return {};
}
const auto& intersection_start = r2.start();
const auto& intersection_end = ends_before_end(r1, r2, cmp) ? r1.end() : r2.end();
if (intersection_start == intersection_end && intersection_end.has_value()) {
return query::clustering_range::make_singular(intersection_end->value());
}
return query::clustering_range(intersection_start, intersection_end);
}
struct range_less {
const class schema& s;
clustering_key_prefix::less_compare cmp = clustering_key_prefix::less_compare(s);
bool operator()(const query::clustering_range& x, const query::clustering_range& y) const {
if (!x.start() && !y.start()) {
return false;
}
if (!x.start()) {
return true;
}
if (!y.start()) {
return false;
}
return cmp(x.start()->value(), y.start()->value());
}
};
/// An expression visitor that translates multi-column atoms into clustering ranges.
struct multi_column_range_accumulator {
const query_options& options;
const schema_ptr schema;
std::vector<query::clustering_range> ranges{query::clustering_range::make_open_ended_both_sides()};
const clustering_key_prefix::prefix_equal_tri_compare prefix3cmp = get_unreversed_tri_compare(*schema);
void operator()(const binary_operator& binop) {
if (is_compare(binop.op)) {
auto opt_values = dynamic_pointer_cast<tuples::value>(binop.rhs->bind(options))->get_elements();
auto& lhs = std::get<std::vector<column_value>>(binop.lhs);
std::vector<bytes> values(lhs.size());
for (size_t i = 0; i < lhs.size(); ++i) {
values[i] = *statements::request_validations::check_not_null(
opt_values[i],
"Invalid null value in condition for column %s", lhs.at(i).col->name_as_text());
}
intersect_all(to_range(binop.op, clustering_key_prefix(values)));
} else if (binop.op == oper_t::IN) {
if (auto dv = dynamic_pointer_cast<lists::delayed_value>(binop.rhs)) {
process_in_values(
dv->get_elements() | transformed(
[&] (const ::shared_ptr<term>& t) {
return static_pointer_cast<tuples::value>(t->bind(options))->get_elements();
}));
} else if (auto mkr = dynamic_pointer_cast<tuples::in_marker>(binop.rhs)) {
// This is `(a,b) IN ?`. RHS elements are themselves tuples, represented as vector<bytes_opt>.
const auto tup = static_pointer_cast<tuples::in_value>(mkr->bind(options));
statements::request_validations::check_not_null(tup, "Invalid null value for IN restriction");
process_in_values(tup->get_split_values());
}
else {
on_internal_error(rlogger, format("multi_column_range_accumulator: unexpected atom {}", binop));
}
} else {
on_internal_error(rlogger, format("multi_column_range_accumulator: unexpected atom {}", binop));
}
}
void operator()(const conjunction& c) {
std::ranges::for_each(c.children, [this] (const expression& child) { std::visit(*this, child); });
}
void operator()(bool b) {
if (!b) {
ranges.clear();
}
}
/// Intersects each range with v. If any intersection is empty, clears ranges.
void intersect_all(const query::clustering_range& v) {
for (auto& r : ranges) {
auto intrs = intersection(r, v, prefix3cmp);
if (!intrs) {
ranges.clear();
break;
}
r = *intrs;
}
}
template<std::ranges::range Range>
requires std::convertible_to<typename Range::value_type::value_type, bytes_opt>
void process_in_values(Range in_values) {
if (ranges.empty()) {
return; // Shortcircuit an easy case.
}
std::set<query::clustering_range, range_less> new_ranges(range_less{*schema});
for (const auto& current_tuple : in_values) {
// Each IN value is like a separate EQ restriction ANDed to the existing state.
auto current_range = to_range(
oper_t::EQ, clustering_key_prefix::from_optional_exploded(*schema, current_tuple));
for (const auto& r : ranges) {
auto intrs = intersection(r, current_range, prefix3cmp);
if (intrs) {
new_ranges.insert(*intrs);
}
}
}
ranges.assign(new_ranges.cbegin(), new_ranges.cend());
}
};
/// Calculates clustering bounds for the multi-column case.
std::vector<query::clustering_range> get_multi_column_clustering_bounds(
const query_options& options,
schema_ptr schema,
const std::vector<expression>& multi_column_restrictions) {
multi_column_range_accumulator acc{options, schema};
for (const auto& restr : multi_column_restrictions) {
std::visit(acc, restr);
}
return acc.ranges;
}
/// Reverses the range if the type is reversed. Why don't we have nonwrapping_interval::reverse()??
query::clustering_range reverse_if_reqd(query::clustering_range r, const abstract_type& t) {
return t.is_reversed() ? query::clustering_range(r.end(), r.start()) : std::move(r);
}
void error_if_exceeds(size_t size, size_t limit) {
if (size > limit) {
throw std::runtime_error(
fmt::format("clustering-key cartesian product size {} is greater than maximum {}", size, limit));
}
}
constexpr bool inclusive = true;
/// Calculates clustering bounds for the single-column case.
std::vector<query::clustering_range> get_single_column_clustering_bounds(
const query_options& options,
schema_ptr schema,
const std::vector<expression>& single_column_restrictions) {
const size_t size_limit =
options.get_cql_config().restrictions.clustering_key_restrictions_max_cartesian_product_size;
size_t product_size = 1;
std::vector<std::vector<bytes>> prior_column_values; // Equality values of columns seen so far.
for (size_t i = 0; i < single_column_restrictions.size(); ++i) {
auto values = possible_lhs_values(
&schema->clustering_column_at(i), // This should be the LHS of restrictions[i].
single_column_restrictions[i],
options);
if (auto list = std::get_if<value_list>(&values)) {
if (list->empty()) { // Impossible condition -- no rows can possibly match.
return {};
}
prior_column_values.push_back(*list);
product_size *= list->size();
error_if_exceeds(product_size, size_limit);
} else if (auto last_range = std::get_if<nonwrapping_interval<bytes>>(&values)) {
// Must be the last column in the prefix, since it's neither EQ nor IN.
std::vector<query::clustering_range> ck_ranges;
if (prior_column_values.empty()) {
// This is the first and last range; just turn it into a clustering_key_prefix.
ck_ranges.push_back(
reverse_if_reqd(
last_range->transform([] (const bytes& val) { return clustering_key_prefix({val}); }),
*schema->clustering_column_at(i).type));
} else {
// Prior clustering columns are equality-restricted (either via = or IN), producing one or more
// prior_column_values elements. Now we will turn each such element into a CK range dictated by those
// equalities and this inequality represented by last_range. Each CK range's upper/lower bound is
// formed by extending the Cartesian-product element with the corresponding last_range bound, if it
// exists; if it doesn't, the CK range bound is just the Cartesian-product element, inclusive.
//
// For example, the expression `c1=1 AND c2=2 AND c3>3` makes lower CK bound (1,2,3) exclusive and
// upper CK bound (1,2) inclusive.
ck_ranges.reserve(product_size);
const auto extra_lb = last_range->start(), extra_ub = last_range->end();
for (auto& b : cartesian_product(prior_column_values)) {
auto new_lb = b, new_ub = b;
if (extra_lb) {
new_lb.push_back(extra_lb->value());
}
if (extra_ub) {
new_ub.push_back(extra_ub->value());
}
query::clustering_range::bound new_start(new_lb, extra_lb ? extra_lb->is_inclusive() : inclusive);
query::clustering_range::bound new_end (new_ub, extra_ub ? extra_ub->is_inclusive() : inclusive);
ck_ranges.push_back(reverse_if_reqd({new_start, new_end}, *schema->clustering_column_at(i).type));
}
}
sort(ck_ranges.begin(), ck_ranges.end(), range_less{*schema});
return ck_ranges;
}
}
// All prefix columns are restricted by EQ or IN. The resulting CK ranges are just singular ranges of corresponding
// prior_column_values.
std::vector<query::clustering_range> ck_ranges(product_size);
cartesian_product cp(prior_column_values);
std::transform(cp.begin(), cp.end(), ck_ranges.begin(), std::bind_front(query::clustering_range::make_singular));
sort(ck_ranges.begin(), ck_ranges.end(), range_less{*schema});
return ck_ranges;
}
using opt_bound = std::optional<query::clustering_range::bound>;
/// Makes a partial bound out of whole_bound's prefix. If the partial bound is strictly shorter than the whole, it is
/// exclusive. Otherwise, it matches the whole_bound's inclusivity.
opt_bound make_prefix_bound(
size_t prefix_len, const std::vector<bytes>& whole_bound, bool whole_bound_is_inclusive) {
if (whole_bound.empty()) {
return {};
}
// Couldn't get std::ranges::subrange(whole_bound, prefix_len) to compile :(
std::vector<bytes> partial_bound(
whole_bound.cbegin(), whole_bound.cbegin() + std::min(prefix_len, whole_bound.size()));
return query::clustering_range::bound(
clustering_key_prefix(move(partial_bound)),
prefix_len >= whole_bound.size() && whole_bound_is_inclusive);
}
/// Given a multi-column range in CQL order, breaks it into an equivalent union of clustering-order ranges. Returns
/// those ranges as vector elements.
///
/// A difference between CQL order and clustering order means that the right-hand side of a clustering-key comparison is
/// not necessarily a single (lower or upper) bound on the clustering key in storage. Eg, `WITH CLUSTERING ORDER BY (a
/// ASC, b DESC)` indicates that "a less than 5" means "a comes before 5 in storage", but "b less than 5" means "b comes
/// AFTER 5 in storage". Therefore the CQL expression (a,b)<(5,5) cannot be executed by fetching a single range from
/// the storage layer -- the right-hand side is not a single upper bound from the storage layer's perspective.
///
/// When translating the WHERE clause into clustering ranges to fetch, it's natural to first calculate the CQL-order
/// ranges: comparisons define ranges, the AND operator intersects them, the IN operator makes a Cartesian product. The
/// result of this is a union of ranges in CQL order that define the clustering slice to fetch. And if the clustering
/// order is the same as the CQL order, these ranges can be sent to the storage proxy directly to fetch the correct
/// result. But if the two orders differ, there is some work to be done first. This is simple enough for ranges that
/// only vary a single column -- see reverse_if_reqd(). Multi-column ranges are more complicated; they are translated
/// into an equivalent union of clustering-order ranges by get_equivalent_ranges().
///
/// Continuing the above example, we can translate the CQL expression (a,b)<(5,5) into a union of several ranges that
/// are continuous in storage. We begin by observing that (a,b)<(5,5) is the same as a<5 OR (a=5 AND b<5). This is a
/// union of two ranges: the range corresponding to a<5, plus the range corresponding to (a=5 AND b<5). Note that both
/// of these ranges are continuous in storage because they only vary a single column:
///
/// * a<5 is a range from -inf to clustering_key_prefix(5) exclusive
///
/// * (a=5 AND b<5) is a range from clustering_key_prefix(5,5) exclusive to clustering_key_prefix(5) inclusive; note
/// the clustering order between those start/end bounds
///
/// Here is an illustration of those two ranges in storage, with rows represented vertically and clustering-ordered left
/// to right:
///
/// a: 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6
/// b: 5 4 3 2 1 0 7 6 5 4 3 2 1 0 5 4 3 2 1
/// 1st range ^^^^^^^^^^^ ^^^^^^^^^ 2nd range
///
/// For more examples of this range translation, please see the statement_restrictions unit tests.
std::vector<query::clustering_range> get_equivalent_ranges(
const query::clustering_range& cql_order_range, const schema& schema) {
const auto& cql_lb = cql_order_range.start();
const auto& cql_ub = cql_order_range.end();
if (cql_lb == cql_ub && (!cql_lb || cql_lb->is_inclusive())) {
return {cql_order_range};
}
const auto cql_lb_bytes = cql_lb ? cql_lb->value().explode(schema) : std::vector<bytes>{};
const auto cql_ub_bytes = cql_ub ? cql_ub->value().explode(schema) : std::vector<bytes>{};
const bool cql_lb_is_inclusive = cql_lb ? cql_lb->is_inclusive() : false;
const bool cql_ub_is_inclusive = cql_ub ? cql_ub->is_inclusive() : false;
size_t common_prefix_len = 0;
// Skip equal values; they don't contribute to equivalent-range generation.
while (common_prefix_len < cql_lb_bytes.size() && common_prefix_len < cql_ub_bytes.size() &&
cql_lb_bytes[common_prefix_len] == cql_ub_bytes[common_prefix_len]) {
++common_prefix_len;
}
std::vector<query::clustering_range> ranges;
// First range is special: it has both bounds.
opt_bound lb1 = make_prefix_bound(
common_prefix_len + 1, cql_lb_bytes, cql_lb_is_inclusive);
opt_bound ub1 = make_prefix_bound(
common_prefix_len + 1, cql_ub_bytes, cql_ub_is_inclusive);
auto range1 = schema.clustering_column_at(common_prefix_len).type->is_reversed() ?
query::clustering_range(ub1, lb1) : query::clustering_range(lb1, ub1);
ranges.push_back(std::move(range1));
for (size_t p = common_prefix_len + 2; p <= cql_lb_bytes.size(); ++p) {
opt_bound lb = make_prefix_bound(p, cql_lb_bytes, cql_lb_is_inclusive);
opt_bound ub = make_prefix_bound(p - 1, cql_lb_bytes, /*irrelevant:*/true);
if (ub) {
ub = query::clustering_range::bound(ub->value(), inclusive);
}
auto range = schema.clustering_column_at(p - 1).type->is_reversed() ?
query::clustering_range(ub, lb) : query::clustering_range(lb, ub);
ranges.push_back(std::move(range));
}
for (size_t p = common_prefix_len + 2; p <= cql_ub_bytes.size(); ++p) {
// Note the difference from the cql_lb_bytes case above!
opt_bound ub = make_prefix_bound(p, cql_ub_bytes, cql_ub_is_inclusive);
opt_bound lb = make_prefix_bound(p - 1, cql_ub_bytes, /*irrelevant:*/true);
if (lb) {
lb = query::clustering_range::bound(lb->value(), inclusive);
}
auto range = schema.clustering_column_at(p - 1).type->is_reversed() ?
query::clustering_range(ub, lb) : query::clustering_range(lb, ub);
ranges.push_back(std::move(range));
}
return ranges;
}
/// Extracts raw multi-column bounds from exprs; last one wins.
query::clustering_range range_from_raw_bounds(
const std::vector<expression>& exprs, const query_options& options, const schema& schema) {
opt_bound lb, ub;
for (const auto& e : exprs) {
if (auto b = find_clustering_order(e)) {
const auto tup = dynamic_pointer_cast<tuples::value>(b->rhs->bind(options));
if (!tup) {
on_internal_error(rlogger, format("range_from_raw_bounds: unexpected atom {}", *b));
}
const auto r = to_range(
b->op, clustering_key_prefix::from_optional_exploded(schema, tup->get_elements()));
if (r.start()) {
lb = r.start();
}
if (r.end()) {
ub = r.end();
}
}
}
return {lb, ub};
}
} // anonymous namespace
std::vector<query::clustering_range> statement_restrictions::get_clustering_bounds(const query_options& options) const {
if (_clustering_prefix_restrictions.empty()) {
if (_clustering_columns_restrictions->empty()) {
return {query::clustering_range::make_open_ended_both_sides()};
}
if (count_if(_clustering_prefix_restrictions[0], expr::is_multi_column)) {
bool all_natural = true, all_reverse = true; ///< Whether column types are reversed or natural.
for (auto& r : _clustering_prefix_restrictions) { // TODO: move to constructor, do only once.
using namespace expr;
const auto& binop = std::get<binary_operator>(r);
if (is_clustering_order(binop)) {
return {range_from_raw_bounds(_clustering_prefix_restrictions, options, *_schema)};
}
for (auto& cv : std::get<std::vector<column_value>>(binop.lhs)) {
if (cv.col->type->is_reversed()) {
all_natural = false;
} else {
all_reverse = false;
}
}
if (_clustering_columns_restrictions->needs_filtering(*_schema)) {
if (auto single_ck_restrictions = dynamic_pointer_cast<single_column_clustering_key_restrictions>(_clustering_columns_restrictions)) {
return single_ck_restrictions->get_longest_prefix_restrictions()->bounds_ranges(options);
}
auto bounds = get_multi_column_clustering_bounds(options, _schema, _clustering_prefix_restrictions);
if (!all_natural && !all_reverse) {
std::vector<query::clustering_range> bounds_in_clustering_order;
for (const auto& b : bounds) {
const auto eqv = get_equivalent_ranges(b, *_schema);
bounds_in_clustering_order.insert(bounds_in_clustering_order.end(), eqv.cbegin(), eqv.cend());
}
return bounds_in_clustering_order;
}
if (all_reverse) {
for (auto& crange : bounds) {
crange = query::clustering_range(crange.end(), crange.start());
}
}
return bounds;
} else {
return get_single_column_clustering_bounds(options, _schema, _clustering_prefix_restrictions);
return {query::clustering_range::make_open_ended_both_sides()};
}
return _clustering_columns_restrictions->bounds_ranges(options);
}
namespace {

View File

@@ -104,11 +104,6 @@ private:
bool _has_queriable_regular_index = false, _has_queriable_pk_index = false, _has_queriable_ck_index = false;
std::optional<expr::expression> _where; ///< The entire WHERE clause.
std::vector<expr::expression> _clustering_prefix_restrictions; ///< Parts of _where defining the clustering slice.
bool _partition_range_is_simple; ///< False iff _partition_range_restrictions imply a Cartesian product.
public:
/**
* Creates a new empty <code>StatementRestrictions</code>.

View File

@@ -25,7 +25,6 @@
#include "stats.hh"
namespace cql3 {
class untyped_result_set;
class result_generator {
schema_ptr _schema;
@@ -34,7 +33,6 @@ class result_generator {
shared_ptr<const selection::selection> _selection;
cql_stats* _stats;
private:
friend class untyped_result_set;
template<typename Visitor>
class query_result_visitor {
const schema& _schema;
@@ -84,7 +82,7 @@ private:
if (_clustering_key.size() > def->component_index()) {
_visitor.accept_value(query::result_bytes_view(bytes_view(_clustering_key[def->component_index()])));
} else {
_visitor.accept_value(std::nullopt);
_visitor.accept_value({});
}
break;
case column_kind::regular_column:
@@ -109,7 +107,7 @@ private:
} else if (def->is_static()) {
accept_cell_value(*def, static_row_iterator);
} else {
_visitor.accept_value(std::nullopt);
_visitor.accept_value({});
}
}
_visitor.end_row();

View File

@@ -97,7 +97,7 @@ public:
if (!value) {
return std::nullopt;
}
auto&& buffers = _type->split(single_fragmented_view(*value));
auto&& buffers = _type->split(*value);
bytes_opt ret;
if (_field < buffers.size() && buffers[_field]) {
ret = to_bytes(*buffers[_field]);

View File

@@ -44,7 +44,6 @@
#include "service/migration_manager.hh"
#include "db/system_keyspace.hh"
#include "database.hh"
#include "cql3/query_processor.hh"
bool is_system_keyspace(std::string_view keyspace);
@@ -92,11 +91,10 @@ void cql3::statements::alter_keyspace_statement::validate(service::storage_proxy
}
}
future<shared_ptr<cql_transport::event::schema_change>> cql3::statements::alter_keyspace_statement::announce_migration(query_processor& qp) const {
service::storage_proxy& proxy = qp.proxy();
future<shared_ptr<cql_transport::event::schema_change>> cql3::statements::alter_keyspace_statement::announce_migration(service::storage_proxy& proxy) const {
auto old_ksm = proxy.get_db().local().find_keyspace(_name).metadata();
const auto& tm = *proxy.get_token_metadata_ptr();
return qp.get_migration_manager().announce_keyspace_update(_attrs->as_ks_metadata_update(old_ksm, tm)).then([this] {
return service::get_local_migration_manager().announce_keyspace_update(_attrs->as_ks_metadata_update(old_ksm, tm)).then([this] {
using namespace cql_transport;
return ::make_shared<event::schema_change>(
event::schema_change::change_type::UPDATED,

View File

@@ -48,8 +48,6 @@
namespace cql3 {
class query_processor;
namespace statements {
class alter_keyspace_statement : public schema_altering_statement {
@@ -63,7 +61,7 @@ public:
future<> check_access(service::storage_proxy& proxy, const service::client_state& state) const override;
void validate(service::storage_proxy& proxy, const service::client_state& state) const override;
future<shared_ptr<cql_transport::event::schema_change>> announce_migration(query_processor& qp) const override;
future<shared_ptr<cql_transport::event::schema_change>> announce_migration(service::storage_proxy& proxy) const override;
virtual std::unique_ptr<prepared_statement> prepare(database& db, cql_stats& stats) override;
};

View File

@@ -49,8 +49,6 @@
namespace cql3 {
class query_processor;
namespace statements {
class alter_role_statement final : public authentication_statement {
@@ -71,7 +69,7 @@ public:
virtual future<> check_access(service::storage_proxy& proxy, const service::client_state&) const override;
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute(query_processor&, service::query_state&, const query_options&) const override;
execute(service::storage_proxy&, service::query_state&, const query_options&) const override;
};
}

View File

@@ -51,16 +51,15 @@
#include "view_info.hh"
#include "database.hh"
#include "db/view/view.hh"
#include "cql3/query_processor.hh"
namespace cql3 {
namespace statements {
alter_table_statement::alter_table_statement(cf_name name,
alter_table_statement::alter_table_statement(shared_ptr<cf_name> name,
type t,
std::vector<column_change> column_changes,
std::optional<cf_prop_defs> properties,
shared_ptr<cf_prop_defs> properties,
renames_type renames)
: schema_altering_statement(std::move(name))
, _type(t)
@@ -289,9 +288,9 @@ void alter_table_statement::drop_column(const schema& schema, const table& cf, s
}
}
future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::announce_migration(query_processor& qp) const
future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::announce_migration(service::storage_proxy& proxy) const
{
database& db = qp.db();
auto& db = proxy.get_db().local();
auto s = validation::validate_column_family(db, keyspace(), column_family());
if (s->is_view()) {
throw exceptions::invalid_request_exception("Cannot use ALTER TABLE on Materialized View");
@@ -397,7 +396,7 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
break;
}
return qp.get_migration_manager().announce_column_family_update(cfm.build(), false, std::move(view_updates))
return service::get_local_migration_manager().announce_column_family_update(cfm.build(), false, std::move(view_updates))
.then([this] {
using namespace cql_transport;
return ::make_shared<event::schema_change>(

View File

@@ -48,8 +48,6 @@
namespace cql3 {
class query_processor;
namespace statements {
class alter_table_statement : public schema_altering_statement {
@@ -71,18 +69,18 @@ public:
private:
const type _type;
const std::vector<column_change> _column_changes;
const std::optional<cf_prop_defs> _properties;
const shared_ptr<cf_prop_defs> _properties;
const renames_type _renames;
public:
alter_table_statement(cf_name name,
alter_table_statement(shared_ptr<cf_name> name,
type t,
std::vector<column_change> column_changes,
std::optional<cf_prop_defs> properties,
shared_ptr<cf_prop_defs> properties,
renames_type renames);
virtual future<> check_access(service::storage_proxy& proxy, const service::client_state& state) const override;
virtual void validate(service::storage_proxy& proxy, const service::client_state& state) const override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(query_processor& qp) const override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(service::storage_proxy& proxy) const override;
virtual std::unique_ptr<prepared_statement> prepare(database& db, cql_stats& stats) override;
private:
void add_column(const schema& schema, const table& cf, schema_builder& cfm, std::vector<view_ptr>& view_updates, const column_identifier& column_name, const cql3_type validator, const column_definition* def, bool is_static) const;

View File

@@ -39,7 +39,6 @@
#include "cql3/statements/alter_type_statement.hh"
#include "cql3/statements/create_type_statement.hh"
#include "cql3/query_processor.hh"
#include "prepared_statement.hh"
#include "schema_builder.hh"
#include "service/migration_manager.hh"
@@ -79,7 +78,7 @@ const sstring& alter_type_statement::keyspace() const
return _name.get_keyspace();
}
void alter_type_statement::do_announce_migration(database& db, service::migration_manager& mm, ::keyspace& ks) const
void alter_type_statement::do_announce_migration(database& db, ::keyspace& ks) const
{
auto&& all_types = ks.metadata()->user_types().get_all_types();
auto to_update = all_types.find(_name.get_user_type_name());
@@ -101,7 +100,7 @@ void alter_type_statement::do_announce_migration(database& db, service::migratio
// Now, we need to announce the type update to basically change it for new tables using this type,
// but we also need to find all existing user types and CF using it and change them.
mm.announce_type_update(updated).get();
service::get_local_migration_manager().announce_type_update(updated).get();
for (auto&& schema : ks.metadata()->cf_meta_data() | boost::adaptors::map_values) {
auto cfm = schema_builder(schema);
@@ -116,21 +115,21 @@ void alter_type_statement::do_announce_migration(database& db, service::migratio
}
if (modified) {
if (schema->is_view()) {
mm.announce_view_update(view_ptr(cfm.build())).get();
service::get_local_migration_manager().announce_view_update(view_ptr(cfm.build())).get();
} else {
mm.announce_column_family_update(cfm.build(), false, {}).get();
service::get_local_migration_manager().announce_column_family_update(cfm.build(), false, {}).get();
}
}
}
}
future<shared_ptr<cql_transport::event::schema_change>> alter_type_statement::announce_migration(query_processor& qp) const
future<shared_ptr<cql_transport::event::schema_change>> alter_type_statement::announce_migration(service::storage_proxy& proxy) const
{
return seastar::async([this, &qp] {
database& db = qp.db();
return seastar::async([this, &proxy] {
auto&& db = proxy.get_db().local();
try {
auto&& ks = db.find_keyspace(keyspace());
do_announce_migration(db, qp.get_migration_manager(), ks);
do_announce_migration(db, ks);
using namespace cql_transport;
return ::make_shared<event::schema_change>(
event::schema_change::change_type::UPDATED,

View File

@@ -45,14 +45,8 @@
#include "cql3/cql3_type.hh"
#include "cql3/ut_name.hh"
namespace service {
class migration_manager;
}
namespace cql3 {
class query_processor;
namespace statements {
class alter_type_statement : public schema_altering_statement {
@@ -69,14 +63,14 @@ public:
virtual const sstring& keyspace() const override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(query_processor& qp) const override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(service::storage_proxy& proxy) const override;
class add_or_alter;
class renames;
protected:
virtual user_type make_updated_type(database& db, user_type to_update) const = 0;
private:
void do_announce_migration(database& db, service::migration_manager& mm, ::keyspace& ks) const;
void do_announce_migration(database& db, ::keyspace& ks) const;
};
class alter_type_statement::add_or_alter : public alter_type_statement {

View File

@@ -46,13 +46,12 @@
#include "view_info.hh"
#include "db/extensions.hh"
#include "database.hh"
#include "cql3/query_processor.hh"
namespace cql3 {
namespace statements {
alter_view_statement::alter_view_statement(cf_name view_name, std::optional<cf_prop_defs> properties)
alter_view_statement::alter_view_statement(::shared_ptr<cf_name> view_name, ::shared_ptr<cf_prop_defs> properties)
: schema_altering_statement{std::move(view_name)}
, _properties{std::move(properties)}
{
@@ -77,9 +76,9 @@ void alter_view_statement::validate(service::storage_proxy&, const service::clie
// validated in announce_migration()
}
future<shared_ptr<cql_transport::event::schema_change>> alter_view_statement::announce_migration(query_processor& qp) const
future<shared_ptr<cql_transport::event::schema_change>> alter_view_statement::announce_migration(service::storage_proxy& proxy) const
{
database& db = qp.db();
auto&& db = proxy.get_db().local();
schema_ptr schema = validation::validate_column_family(db, keyspace(), column_family());
if (!schema->is_view()) {
throw exceptions::invalid_request_exception("Cannot use ALTER MATERIALIZED VIEW on Table");
@@ -90,7 +89,7 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_view_statement::an
}
auto schema_extensions = _properties->make_schema_extensions(db.extensions());
_properties->validate(db, schema_extensions);
_properties->validate(proxy.get_db().local(), schema_extensions);
auto builder = schema_builder(schema);
_properties->apply_to_builder(builder, std::move(schema_extensions));
@@ -109,7 +108,7 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_view_statement::an
"the corresponding data in the parent table.");
}
return qp.get_migration_manager().announce_view_update(view_ptr(builder.build())).then([this] {
return service::get_local_migration_manager().announce_view_update(view_ptr(builder.build())).then([this] {
using namespace cql_transport;
return ::make_shared<event::schema_change>(

View File

@@ -50,22 +50,20 @@
namespace cql3 {
class query_processor;
namespace statements {
/** An <code>ALTER MATERIALIZED VIEW</code> parsed from a CQL query statement. */
class alter_view_statement : public schema_altering_statement {
private:
std::optional<cf_prop_defs> _properties;
::shared_ptr<cf_prop_defs> _properties;
public:
alter_view_statement(cf_name view_name, std::optional<cf_prop_defs> properties);
alter_view_statement(::shared_ptr<cf_name> view_name, ::shared_ptr<cf_prop_defs> properties);
virtual future<> check_access(service::storage_proxy& proxy, const service::client_state& state) const override;
virtual void validate(service::storage_proxy&, const service::client_state& state) const override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(query_processor& qp) const override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(service::storage_proxy& proxy) const override;
virtual std::unique_ptr<prepared_statement> prepare(database& db, cql_stats& stats) override;
};

View File

@@ -46,7 +46,13 @@ uint32_t cql3::statements::authentication_statement::get_bound_terms() const {
return 0;
}
bool cql3::statements::authentication_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
bool cql3::statements::authentication_statement::depends_on_keyspace(
const sstring& ks_name) const {
return false;
}
bool cql3::statements::authentication_statement::depends_on_column_family(
const sstring& cf_name) const {
return false;
}

View File

@@ -56,7 +56,9 @@ public:
uint32_t get_bound_terms() const override;
bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
bool depends_on_keyspace(const sstring& ks_name) const override;
bool depends_on_column_family(const sstring& cf_name) const override;
future<> check_access(service::storage_proxy& proxy, const service::client_state& state) const override;

View File

@@ -46,7 +46,13 @@ uint32_t cql3::statements::authorization_statement::get_bound_terms() const {
return 0;
}
bool cql3::statements::authorization_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
bool cql3::statements::authorization_statement::depends_on_keyspace(
const sstring& ks_name) const {
return false;
}
bool cql3::statements::authorization_statement::depends_on_column_family(
const sstring& cf_name) const {
return false;
}

View File

@@ -60,7 +60,9 @@ public:
uint32_t get_bound_terms() const override;
bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
bool depends_on_keyspace(const sstring& ks_name) const override;
bool depends_on_column_family(const sstring& cf_name) const override;
future<> check_access(service::storage_proxy& proxy, const service::client_state& state) const override;

View File

@@ -45,7 +45,6 @@
#include "database.hh"
#include <seastar/core/execution_stage.hh>
#include "cas_request.hh"
#include "cql3/query_processor.hh"
namespace cql3 {
@@ -60,8 +59,8 @@ timeout_for_type(batch_statement::type t) {
: &timeout_config::write_timeout;
}
db::timeout_clock::duration batch_statement::get_timeout(const service::client_state& state, const query_options& options) const {
return _attrs->is_timeout_set() ? _attrs->get_timeout(options) : state.get_timeout_config().*get_timeout_config_selector();
db::timeout_clock::duration batch_statement::get_timeout(const query_options& options) const {
return _attrs->is_timeout_set() ? _attrs->get_timeout(options) : options.get_timeout_config().*get_timeout_config_selector();
}
batch_statement::batch_statement(int bound_terms, type type_,
@@ -93,9 +92,14 @@ batch_statement::batch_statement(type type_,
{
}
bool batch_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const
bool batch_statement::depends_on_keyspace(const sstring& ks_name) const
{
return boost::algorithm::any_of(_statements, [&ks_name, &cf_name] (auto&& s) { return s.statement->depends_on(ks_name, cf_name); });
return false;
}
bool batch_statement::depends_on_column_family(const sstring& cf_name) const
{
return false;
}
uint32_t batch_statement::get_bound_terms() const
@@ -259,8 +263,7 @@ static thread_local inheriting_concrete_execution_stage<
api::timestamp_type> batch_stage{"cql3_batch", batch_statement_executor::get()};
future<shared_ptr<cql_transport::messages::result_message>> batch_statement::execute(
query_processor& qp, service::query_state& state, const query_options& options) const {
service::storage_proxy& storage = qp.proxy();
service::storage_proxy& storage, service::query_state& state, const query_options& options) const {
cql3::util::validate_timestamp(options, _attrs);
return batch_stage(this, seastar::ref(storage), seastar::ref(state),
seastar::cref(options), false, options.get_timestamp(state));
@@ -287,7 +290,7 @@ future<shared_ptr<cql_transport::messages::result_message>> batch_statement::do_
++_stats.batches;
_stats.statements_in_batches += _statements.size();
auto timeout = db::timeout_clock::now() + get_timeout(query_state.get_client_state(), options);
auto timeout = db::timeout_clock::now() + get_timeout(options);
return get_mutations(storage, options, timeout, local, now, query_state).then([this, &storage, &options, timeout, tr_state = query_state.get_trace_state(),
permit = query_state.get_permit()] (std::vector<mutation> ms) mutable {
return execute_without_conditions(storage, std::move(ms), options.get_consistency(), timeout, std::move(tr_state), std::move(permit));
@@ -344,7 +347,7 @@ future<shared_ptr<cql_transport::messages::result_message>> batch_statement::exe
schema_ptr schema;
db::timeout_clock::time_point now = db::timeout_clock::now();
const timeout_config& cfg = qs.get_client_state().get_timeout_config();
const timeout_config& cfg = options.get_timeout_config();
auto batch_timeout = now + cfg.write_timeout; // Statement timeout.
auto cas_timeout = now + cfg.cas_timeout; // Ballot contention timeout.
auto read_timeout = now + cfg.read_timeout; // Query timeout.

View File

@@ -56,8 +56,6 @@
namespace cql3 {
class query_processor;
namespace statements {
/**
@@ -120,7 +118,9 @@ public:
std::unique_ptr<attributes> attrs,
cql_stats& stats);
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
virtual bool depends_on_keyspace(const sstring& ks_name) const override;
virtual bool depends_on_column_family(const sstring& cf_name) const override;
virtual uint32_t get_bound_terms() const override;
@@ -150,7 +150,7 @@ public:
static void verify_batch_size(service::storage_proxy& proxy, const std::vector<mutation>& mutations);
virtual future<shared_ptr<cql_transport::messages::result_message>> execute(
query_processor& qp, service::query_state& state, const query_options& options) const override;
service::storage_proxy& storage, service::query_state& state, const query_options& options) const override;
private:
friend class batch_statement_executor;
future<shared_ptr<cql_transport::messages::result_message>> do_execute(
@@ -171,7 +171,7 @@ private:
const query_options& options,
service::query_state& state) const;
db::timeout_clock::duration get_timeout(const service::client_state& state, const query_options& options) const;
db::timeout_clock::duration get_timeout(const query_options& options) const;
public:
// FIXME: no cql_statement::to_string() yet
#if 0

View File

@@ -78,7 +78,7 @@ const sstring cf_prop_defs::COMPACTION_STRATEGY_CLASS_KEY = "class";
const sstring cf_prop_defs::COMPACTION_ENABLED_KEY = "enabled";
schema::extensions_map cf_prop_defs::make_schema_extensions(const db::extensions& exts) const {
schema::extensions_map cf_prop_defs::make_schema_extensions(const db::extensions& exts) {
schema::extensions_map er;
for (auto& p : exts.schema_extensions()) {
auto i = _properties.find(p.first);
@@ -235,7 +235,7 @@ const cdc::options* cf_prop_defs::get_cdc_options(const schema::extensions_map&
return &cdc_ext->get_options();
}
void cf_prop_defs::apply_to_builder(schema_builder& builder, schema::extensions_map schema_extensions) const {
void cf_prop_defs::apply_to_builder(schema_builder& builder, schema::extensions_map schema_extensions) {
if (has_property(KW_COMMENT)) {
builder.set_comment(get_string(KW_COMMENT, ""));
}

View File

@@ -93,7 +93,7 @@ public:
private:
mutable std::optional<sstables::compaction_strategy_type> _compaction_strategy_class;
public:
schema::extensions_map make_schema_extensions(const db::extensions& exts) const;
schema::extensions_map make_schema_extensions(const db::extensions& exts);
void validate(const database& db, const schema::extensions_map& schema_extensions) const;
std::map<sstring, sstring> get_compaction_options() const;
std::optional<std::map<sstring, sstring>> get_compression_options() const;
@@ -121,7 +121,7 @@ public:
int32_t get_paxos_grace_seconds() const;
std::optional<utils::UUID> get_id() const;
void apply_to_builder(schema_builder& builder, schema::extensions_map schema_extensions) const;
void apply_to_builder(schema_builder& builder, schema::extensions_map schema_extensions);
void validate_minimum_int(const sstring& field, int32_t minimum_value, int32_t default_value) const;
};

View File

@@ -48,7 +48,7 @@ namespace statements {
namespace raw {
cf_statement::cf_statement(std::optional<cf_name> cf_name)
cf_statement::cf_statement(::shared_ptr<cf_name> cf_name)
: _cf_name(std::move(cf_name))
{
}

View File

@@ -25,7 +25,6 @@
#include "service/migration_manager.hh"
#include "lua.hh"
#include "database.hh"
#include "cql3/query_processor.hh"
namespace cql3 {
@@ -60,11 +59,11 @@ std::unique_ptr<prepared_statement> create_function_statement::prepare(database&
}
future<shared_ptr<cql_transport::event::schema_change>> create_function_statement::announce_migration(
query_processor& qp) const {
service::storage_proxy& proxy) const {
if (!_func) {
return make_ready_future<::shared_ptr<cql_transport::event::schema_change>>();
}
return qp.get_migration_manager().announce_new_function(_func).then([this] {
return service::get_local_migration_manager().announce_new_function(_func).then([this] {
return create_schema_change(*_func, true);
});
}

Some files were not shown because too many files have changed in this diff Show More