Compare commits

...

50 Commits

Author SHA1 Message Date
Avi Kivity
dffbcabbb1 Merge "mutation_writer: explicitly close writers" from Benny
"
_consumer_fut is expected to return an exception
on the abort path.  Wait for it and drop any exception
so it won't be abandoned as seen in #7904.

A future<> close() method was added to return
_consumer_fut.  It is called both after abort()
in the error path, and after consume_end_of_stream,
on the success path.

With that, consume_end_of_stream was made void
as it doesn't return a future<> anymore.

Fixes #7904
Test: unit(release)
"

* tag 'close-bucket-writer-v5' of github.com:bhalevy/scylla:
  mutation_writer: bucket_writer: add close
  mutation_writer/feed_writers: refactor bucket/shard writers
  mutation_writer: update bucket/shard writers consume_end_of_stream

(cherry picked from commit f11a0700a8)
2021-03-21 18:09:45 +02:00
Avi Kivity
a715c27a7f Merge 'cdc: Limit size of topology description' from Piotr Jastrzębski
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.

This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.

This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.

This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.

For more details see #7961, #7993 and #7985.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #8048

* github.com:scylladb/scylla:
  cdc: Limit size of topology description
  cdc: Extract create_stream_ids from topology_description_generator

(cherry picked from commit c63e26e26f)
2021-03-21 14:05:36 +02:00
Benny Halevy
6804332291 dist: scylla_util: prevent IndexError when no ephemeral_disks were found
Currently we call firstNvmeSize before checking that we have enough
(at least 1) ephemeral disks.  When none are found, we hit the following
error (see #7971):
```
File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in
if idata.is_recommended_instance():
File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance
diskSize = self.firstNvmeSize
File "/opt/scylladb/scripts/scylla_util.py", line 291, in firstNvmeSize
firstDisk = ephemeral_disks[0]
IndexError: list index out of range
```

This change reverses the order and first checks that we found
enough disks before getting the fist disk size.

Fixes #7971

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8027

(cherry picked from commit 55e3df8a72)
2021-03-21 12:19:24 +02:00
Nadav Har'El
a20991ad62 storage_service: correct missing exception in logging rebuild failure
When failing to rebuild a node, we would print the error with the useless
explanation "<no exception>". The problem was a typo in the logging command
which used std::current_exception() - which wasn't relevant in that point -
instead of "ep".

Refs #8089

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>
(cherry picked from commit d73934372d)
2021-03-21 10:51:04 +02:00
Hagit Segev
ff585e0834 release: prepare for 4.4.0 2021-03-19 16:42:30 +02:00
Nadav Har'El
9a58deaaa2 alternator-test: increase read timeout and avoid retries
By default the boto3 library waits up to 60 second for a response,
and if got no response, it sends the same request again, multiple
times. We already noticed in the past that it retries too many times
thus slowing down failures, so in our test configuration lowered the
number of retries to 3, but the setting of 60-second-timeout plus
3 retries still causes two problems:

  1. When the test machine and the build are extremely slow, and the
     operation is long (usually, CreateTable or DeleteTable involving
     multiple views), the 60 second timeout might not be enough.

  2. If the timeout is reached, boto3 silently retries the same operation.
     This retry may fail because the previous one really succeeded at
     least partially! The symptom is tests which report an error when
     creating a table which already exists, or deleting a table which
     dooesn't exist.

The solution in this patch is first of all to never do retries - if
a query fails on internal server error, or times out, just report this
failure immediately. We don't expect to see transient errors during
local tests, so this is exactly the right behavior.
The second thing we do is to increase the default timeout. If 1 minute
was not enough, let's raise it to 5 minutes. 5 minutes should be enough
for every operation (famous last words...).

Even if 5 minutes is not enough for something, at least we'll now see
the timeout errors instead of some wierd errors caused by retrying an
operation which was already almost done.

Fixes #8135

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210222125630.1325011-1-nyh@scylladb.com>
(cherry picked from commit 0b2cf21932)
2021-03-19 00:08:27 +02:00
Raphael S. Carvalho
ee48ed2864 LCS: reshape: tolerate more sstables in level 0 with relaxed mode
Relaxed mode, used during initialization, of reshape only tolerates min_threshold
(default: 4) L0 sstables. However, relaxed mode should tolerate more sstables in
level 0, otherwise boot will have to reshape level 0 every time it crosses the
min threshold. So let's make LCS reshape tolerate a max of max_threshold and 32.
This change is beneficial because once table is populated, LCS regular compaction
can decide to merge those sstables in level 0 into level 1 instead, therefore
reducing WA.

Refs #8297.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210318131442.17935-1-raphaelsc@scylladb.com>
(cherry picked from commit e53cedabb1)
2021-03-18 19:19:46 +02:00
Raphael S. Carvalho
03f2eb529f compaction_manager: Fix performance of cleanup compaction due to unlimited parallelism
Prior to 463d0ab, only one table could be cleaned up at a time on a given shard.
Since then, all tables belonging to a given keyspace are cleaned up in parallel.
Cleanup serialization on each shard was enforced with a semaphore, which was
incorrectly removed by the patch aforementioned.

So space requirement for cleanup to succeed can be up to the size of keyspace,
increasing the chances of node running out of space.

Node could also run out of memory if there are tons of tables in the keyspace.
Memory requirement is at least #_of_tables * 128k (not taking into account write
behind, etc). With 5k tables, it's ~0.64G per shard.

Also all tables being cleaned up in parallel will compete for the same
disk and cpu bandwidth, so making them all much slower, and consequently
the operation time is significantly higher.

This problem was detected with cleanup, but scrub and upgrade go through the
same rewrite procedure, so they're affected by exact the same problem.

Fixes #8247.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>
(cherry picked from commit 7171244844)
2021-03-18 14:28:57 +02:00
Dejan Mircevski
c270014121 cql3/expr: Handle IN ? bound to null
Previously, we crashed when the IN marker is bound to null.  Throw
invalid_request_exception instead.

This is a 4.4 backport of the #8265 fix.

Tests: unit (dev)

(cherry picked from commit 8db24fc03b)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8307

Fixes #8265.
2021-03-18 10:35:55 +02:00
Pavel Emelyanov
35804855f9 test: Fix exit condition of row_cache_test::test_eviction_from_invalidated
The test populates the cache, then invalidates it, then tries to push
huge (10x times the segment size) chunks into seastar memory hoping that
the invalid entries will be evicted. The exit condition on the last
stage is -- total memory of the region (sum of both -- used and free)
becomes less than the size of one chunk.

However, the condition is wrong, because cache usually contains a dummy
entry that's not necessarily on lru and on some test iteration it may
happen that

  evictable size < chunk size < evictable size + dummy size

In this case test fails with bad_alloc being unable to evict the memory
from under the dummy.

fixes: #7959
tests: unit(row_cache_test), unit(the failing case with the triggering
       seed from the issue + 200 times more with random seeds)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210309134138.28099-1-xemul@scylladb.com>
(cherry picked from commit 096e452db9)
2021-03-16 23:42:11 +01:00
Piotr Sarna
a810e57684 Merge 'Alternator: support nested attribute paths...
in all expressions' from Nadav Har'El.

This series fixes #5024 - which is about adding support for nested attribute
paths (e.g., a.b.c[2]) to Alternator.  The series adds complete support for this
feature in ProjectionExpression, ConditionExpression, FilterExpression and
UpdateExpression - and also its combination with ReturnValues. Many relevant
tests - and also some new tests added in this series - now pass.

The first patch in the series fixes #8043 a bug in some error cases in
conditions, which was discovered while working in this series, and is
conceptually separate from the rest of the series.

Closes #8066

* github.com:scylladb/scylla:
  alternator: correct implemention of UpdateItem with nested attributes and ReturnValues
  alternator: fix bug in ReturnValues=UPDATED_NEW
  alternator: implemented nested attribute paths in UpdateExpression
  alternator: limit the depth of nested paths
  alternator: prepare for UpdateItem nested attribute paths
  alternator: overhaul ProjectionExpression hierarchy implementation
  alternator: make parsed::path object printable
  alternator-test: a few more ProjectionExpression conflict test cases
  alternator-test: improve tests for nested attributes in UpdateExpression
  alternator: support attribute paths in ConditionExpression, FilterExpression
  alternator-test: improve tests for nested attributes in ConditionExpression
  alternator: support attribute paths in ProjectionExpression
  alternator: overhaul attrs_to_get handling
  alternator-test: additional tests for attribute paths in ProjectionExpression
  alternator-test: harden attribute-path tests for ProjectionExpression
  alternator: fix ValidationException in FilterExpression - and more

(cherry picked from commit cbbb7f08a0)
2021-03-15 18:40:12 +02:00
Nadav Har'El
7b19cc17d6 updated tools/java submodule
* tools/java 8080009794...56470fda09 (1):
  > sstableloader: Only escape column names once

Backporting fix to Refs #8229.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-03-15 16:47:47 +02:00
Benny Halevy
101e0e611b storage_service: use atomic_vector for lifecycle_subscribers
So it can be modified while walked to dispatch
subscribed event notifications.

In #8143, there is a race between scylla shutdown and
notify_down(), causing use-after-free of cql_server.

Using an atomic vector itstead and futurizing
unregister_subscriber allows deleting from _lifecycle_subscribers
while walked using atomic_vector::for_each.

Fixes #8143

Test: unit(release)
DTest:
  update_cluster_layout_tests:TestUpdateClusterLayout.add_node_with_large_partition4_test(release)
  materialized_views_test.py:TestMaterializedViews.double_node_failure_during_mv_insert_4_nodes_test(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-2-bhalevy@scylladb.com>
(cherry picked from commit baf5d05631)
2021-03-15 15:25:18 +02:00
Benny Halevy
ba23eb733d cql_server: event_notifier: unregister_subscriber in stop
Move unregister_subscriber from the destructor to stop
as preparation for moving storage_service lifescyle_subscribers
to atomic_vector and futurizing unregister_subscriber.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-1-bhalevy@scylladb.com>
(cherry picked from commit 1ed04affab)

Ref #8143.
2021-03-15 15:25:10 +02:00
Hagit Segev
ec20ff0988 release: prepare for 4.4.rc4 2021-03-11 23:57:55 +02:00
Raphael S. Carvalho
3613b082bc compaction: Prevent cleanup and regular from compacting the same sstable
Due to regression introduced by 463d0ab, regular can compact in parallel a sstable
being compacted by cleanup, scrub or upgrade.

This redundancy causes resources to be wasted, write amplification is increased
and so does the operation time, etc.

That's a potential source of data resurrection because the now-owned data from
a sstable being compacted by both cleanup and regular will still exist in the
node afterwards, so resurrection can happen if node regains ownership.

Fixes #8155.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com>
(cherry picked from commit 2cf0c4bbf1)

Includes fixup patch:

compaction_manager: Fix use-after-free in rewrite_sstables()

Use-after-free introduced by 2cf0c4bbf1.
That's because compacting is moved into then_wrapped() lambda, so it's
potentially freed on the next iteration of repeat().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210309232940.433490-1-raphaelsc@scylladb.com>
(cherry picked from commit f7cc431477)
2021-03-11 08:24:01 +02:00
Asias He
b94208009f gossip: Handle timeout error in gossiper::do_shadow_round
Currently, the rpc timeout error for the GOSSIP_GET_ENDPOINT_STATES verb
is not handled in gossiper::do_shadow_round. If the
GOSSIP_GET_ENDPOINT_STATES rpc call to any of the remote nodes goes
timeout, gossiper::do_shadow_round will throw an exception and fail the
whole boot up process.

It is fine that some of the remote nodes timeout in shadow round. It is
not a must to talk to all nodes.

This patch fixes an issue we saw recently in our sct tests:

```
INFO    | scylla[1579]: [shard 0] init - Shutting down gossiping
INFO    | scylla[1579]: [shard 0] gossip - gossip is already stopped
INFO    | scylla[1579]: [shard 0] init - Shutting down gossiping was successful
...

ERR     | scylla[1579]: [shard 0] init - Startup failed: seastar::rpc::timeout_error (rpc call timed out)
```

Fixes #8187

Closes #8213

(cherry picked from commit dc40184faa)
2021-03-10 16:27:47 +02:00
Nadav Har'El
05c266c02a Merge 'Fix alternator streams management regression' from Calle Wilund
Refs: #8012
Fixes: #8210

With the update to CDC generation management, the way we retrieve and process these changed.
One very bad bug slipped through though; the code for getting versioned streams did not take into
account the late-in-pr change to make clustering of CDC gen timestamps reversed. So our alternator
shard info became quite rump-stumped, leading to more or less no data depending on when generations
changed w.r. data.

Also, the way we track the above timestamps changed, so we should utilize this for our end-of-iterator check.

Closes #8209

* github.com:scylladb/scylla:
  alternator::streams: Use better method for generation timestamp
  system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp
  system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range

(cherry picked from commit e12e57c915)
2021-03-10 16:27:43 +02:00
Avi Kivity
4b7319a870 Merge 'Split CDC streams table partitions into clustered rows ' from Kamil Braun
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments

This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.

The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.

Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.

---

Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.

However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. We add an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).

Closes #8116

* github.com:scylladb/scylla:
  tests: add a simple CDC cql pytest
  cdc: add config option to disable streams rewriting
  cdc: rewrite streams to the new description table
  cql3: query_processor: improve internal paged query API
  cdc: introduce no_generation_data_exception exception type
  docs: cdc: mention system.cdc_local table
  cdc: coroutinize do_update_streams_description
  sys_dist_ks: split CDC streams table partitions into clustered rows
  cdc: use chunked_vector for streams in streams_version
  cdc: remove `streams_version::expired` field
  system_distributed_keyspace: use mutation API to insert CDC streams
  storage_service: don't use `sys_dist_ks` before it is started

(cherry picked from commit f0950e023d)
2021-03-09 14:08:44 +02:00
Takuya ASADA
5ce71f3a29 scylla_raid_setup: don't abort using raiddev when array_state is 'clear'
On Ubuntu 20.04 AMI, scylla_raid_setup --raiddev /dev/md0 causes
'/dev/md0 is already using' (issue #7627).
So we merged the patch to find free mdX (587b909).

However, look into /proc/mdstat of the AMI, it actually says no active md device available:

ubuntu@ip-10-0-0-43:~$ cat /proc/mdstat
Personalities :
unused devices: <none>

We currently decide mdX is used when os.path.exists('/sys/block/mdX/md/array_state') == True,
but according to kernel doc, the file may available even array is STOPPED:

    clear

        No devices, no size, no level
        Writing is equivalent to STOP_ARRAY ioctl
https://www.kernel.org/doc/html/v4.15/admin-guide/md.html

So we should also check array_state != 'clear', not just array_state
existance.

Fixes #8219

Closes #8220

(cherry picked from commit 2d9feaacea)
2021-03-08 14:28:58 +02:00
Pekka Enberg
f06f4f6ee1 Update tools/jmx submodule
* tools/jmx 2c95650...c510a56 (1):
  > APIBuilder: Unlock RW-lock in remove()
2021-03-04 14:36:45 +02:00
Hagit Segev
c2d9247574 release: prepare for 4.4.rc3 2021-03-04 13:38:43 +02:00
Raphael S. Carvalho
b4e393d215 compaction: Fix leak of expired sstable in the backlog tracker
expired sstables are skipped in the compaction setup phase, because they don't
need to be actually compacted, but rather only deleted at the end.
that is causing such sstables to not be removed from the backlog tracker,
meaning that backlog caused by expired sstables will not be removed even after
their deletion, which means shares will be higher than needed, making compaction
potentially more aggressive than it have to.

to fix this bug, let's manually register these sstables into the monitor,
such that they'll be removed from the tracker once compaction completes.

Fixes #6054.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210216203700.189362-1-raphaelsc@scylladb.com>
(cherry picked from commit 5206a97915)
2021-03-01 14:14:33 +02:00
Avi Kivity
1a2b7037cd Update seastar submodule
* seastar 74ae29bc17...2c884a7449 (1):
  > io_queue: Fix "delay" metrics

Fixes #8166.
2021-03-01 13:57:04 +02:00
Raphael S. Carvalho
048f5efe1c sstables: Fix TWCS reshape for windows with at least min_threshold sstables
TWCS reshape was silently ignoring windows which contain at least
min_threshold sstables (can happen with data segregation).
When resizing candidates, size of multi_window was incorrectly used and
it was always empty in this path, which means candidates was always
cleared.

Fixes #8147.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com>
(cherry picked from commit 21608bd677)
2021-02-28 17:20:26 +02:00
Takuya ASADA
056293b95f dist/debian: don't run dh_installinit for scylla-node-exporter when service name == package name
dh_installinit --name <service> is for forcing install debian/*.service
and debian/*.default that does not matches with package name.
And if we have subpackages, packager has responsibility to rename
debian/*.service to debian/<subpackage>.*service.

However, we currently mistakenly running
dh_installinit --name scylla-node-exporter for
debian/scylla-node-exporeter.service,
the packaging system tries to find destination package for the .service,
and does not find subpackage name on it, so it will pick first
subpackage ordered by name, scylla-conf.

To solve the issue, we just need to run dh_installinit without --name
when $product == 'scylla'.

Fixes #8163

Closes #8164

(cherry picked from commit aabc67e386)
2021-02-28 17:20:26 +02:00
Takuya ASADA
f96ea8e011 scylla_setup: allow running scylla_setup with strict umask setting
We currently deny running scylla_setup when umask != 0022.
To remove this limitation, run os.chmod(0o644) on every file creation
to allow reading from scylla user.

Note that perftune.yaml is not really needed to set 0644 since perftune.py is
running in root user, but setting it to align permission with other files.

Fixes #8049

Closes #8119

(cherry picked from commit f3a82f4685)
2021-02-26 08:49:59 +02:00
Hagit Segev
49cd0b87f0 release: prepare for 4.4.rc2 2021-02-24 19:15:29 +02:00
Asias He
0977a73ab2 messaging_service: Move gossip ack message verb to gossip group
Fix a scheduling group leak:

INFO [shard 0] gossip - gossiper::run sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack_msg sg=statement
INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip

After the fix:

INFO [shard 0] gossip - gossiper::run sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack_msg sg=gossip
INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip

Fixes #7986

Closes #8129

(cherry picked from commit 7018377bd7)
2021-02-24 14:11:16 +02:00
Pekka Enberg
9fc582ee83 Update seastar submodule
* seastar 572536ef...74ae29bc (3):
  > perftune.py: fix assignment after extend and add asserts
  > scripts/perftune.py: convert nic option in old perftune.yaml to list for compatibility
  > scripts/perftune.py: remove repeated items after merging options from file
Fixes #7968.
2021-02-23 15:18:00 +02:00
Avi Kivity
4be14c2249 Revert "repair: Make removenode safe by default"
This reverts commit 829b4c1438. It ended
up causing repair failures.

Fixes #7965.
2021-02-23 14:14:07 +02:00
Tomasz Grabiec
3160dd4b59 table: Fix schema mismatch between memtable reader and sstable writer
The schema used to create the sstable writer has to be the same as the
schema used by the reader, as the former is used to intrpret mutation
fragments produced by the reader.

Commit 9124a70 intorduced a deferring point between reader creation
and writer creation which can result in schema mismatch if there was a
concurrent alter.

This could lead to the sstable write to crash, or generate a corrupted
sstable.

Fixes #7994

Message-Id: <20210222153149.289308-1-tgrabiec@scylladb.com>
2021-02-23 13:48:33 +02:00
Avi Kivity
50a8eab1a2 Update seastar submodule
* seastar a287bb1a3...572536ef4 (1):
  > rpc: streaming sink: order outgoing messages

Fixes #7552.
2021-02-23 10:19:45 +02:00
Avi Kivity
04615436a0 Point seastar submodule at scylla-seastar.git
This allows us to backport Seastar fixes to this branch.
2021-02-23 10:17:22 +02:00
Takuya ASADA
d1ab37654e scylla_util.py: resolve /dev/root to get actual device on aws
When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly
reports root partition is part of ephemeral disks, and RAID construction will
fail.
This prevents the error and reports correct free disks.

Fixes #8055

Closes #8040

(cherry picked from commit 32d4ec6b8a)
2021-02-21 16:22:51 +02:00
Nadav Har'El
b47bdb053d alternator: fix ValidationException in FilterExpression - and more
The first condition expressions we implemented in Alternator were the old
"Expected" syntax of conditional updates. That implementation had some
specific assumptions on how it handles errors: For example, in the "LT"
operator in "Expected", the second operand is always part of the query, so
an error in it (e.g., an unsupported type) resulted it a ValidationException
error.

When we implemented ConditionExpression and FilterExpression, we wrongly
used the same functions check_compare(), check_BETWEEN(), etc., to implement
them. This results in some inaccurate error handling. The worst example is
what happens when you use a FilterExpression with an expression such as
"x < y" - this filter is supposed to silently skip items whose "x" and "y"
attributes have unsupported or different types, but in our implementation
a bad type (e.g., a list) for y resulted in a ValidationException which
aborted the entire scan! Interestingly, in once case (that of BEGINS_WITH)
we actually noticed the slightly different behavior needed and implemented
the same operator twice - with ugly code duplication. But in other operators
we missed this problem completely.

This patch first adds extensive tests of how the different expressions
(Expected, QueryFilter, FilterExpression, ConditionExpression) and the
different operators handle various input errors - unsupported types,
missing items, incompatible types, etc. Importantly, the tests demonstrate
that there is often different behavior depending on whether the bad
input comes from the query, or from the item. Some of the new tests
fail before this patch, but others pass and were useful to verify that
the patch doesn't break anything that already worked correctly previously.
As usual, all the tests pass on Cassandra.

Finally, this patch *fixes* all these problems. The comparison functions
like check_compare() and check_BETWEEN() now not only take the operands,
they also take booleans saying if each of the operands came from the
query or from an item. The old-syntax caller (Expected or QueryFilter)
always say that the first operand is from the item and the second is
from the query - but in the new-syntax caller (ConditionExpression or
FilterExpression) any or all of the operands can come from the query
and need verification.

The old duplicated code for check_BEGINS_WITH() - which a TODO to remove
it - is finally removed. Instead we use the same idea of passing booleans
saying if each of its operands came from an item or from the query.

Fixes #8043

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 653610f4bc)
2021-02-21 09:25:01 +02:00
Piotr Sarna
e11ae8c58f test: fix a flaky timeout test depending on TTL
One of the USING TIMEOUT tests relied on a specific TTL value,
but that's fragile if the test runs on the boundary of 2 seconds.
Instead, the test case simply checks if the TTL value is present
and is greater than 0, which makes the test robust unless its execution
lasts for more than 1 million seconds, which is highly unlikely.

Fixes #8062

Closes #8063

(cherry picked from commit 2aa4631148)
2021-02-14 13:08:39 +02:00
Benny Halevy
e4132edef3 stream_session: prepare: fix missing string format argument
As seen in
mv_populating_from_existing_data_during_node_decommission_test dtest:
```
ERROR 2021-02-11 06:01:32,804 [shard 0] stream_session - failed to log message: fmt::v7::format_error (argument not found)
```

Fixes #8067

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210211100158.543952-1-bhalevy@scylladb.com>
(cherry picked from commit d01e7e7b58)
2021-02-14 13:08:20 +02:00
Shlomi Livne
492f0802fb scylla_io_setup did not configure pre tuned gce instances correctly
scylla_io_setup condition for nr_disks was using the bitwise operator
(&) instead of logical and operator (and) causing the io_properties
files to have incorrect values

Fixes #7341

Reviewed-by: Lubos Kosco <lubos@scylladb.com>
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>

Closes #8019

(cherry picked from commit 718976e794)
2021-02-14 13:08:00 +02:00
Takuya ASADA
34f22e1df1 dist/debian: install scylla-node-exporter.service correctly
node-exporter systemd unit name is "scylla-node-exporter.service", not
"node-exporter.service".

Fixes #8054

Closes #8053

(cherry picked from commit 856fe12e13)
2021-02-14 13:07:29 +02:00
Nadav Har'El
acb921845f cql-pytest: fix flaky timeuuid_test.py
The test timeuuid_test.py::testTimeuuid sporadically failed, and it turns out
the reason was a bug in the test - which this patch fixes.

The buggy test created a timeuuid and then compared the time stored in it
to the result of the dateOf() CQL function. The problem is that dateOf()
returns a CQL "timestamp", which has millisecond resolution, while the
timeuuid *may* have finer than millisecond resolution. The reason why this
test rarely failed is that in our implementation, the timeuuid almost
always gets a millisecond-resolution timestamp. Only if now() gets called
more than once in one millisecond, does it pick a higher time incremented
by less than a millisecond.

What this patch does is to truncate the time read from the timeuuid to
millisecond resolution, and only then compare it to the result of dateOf().
We cannot hope for more.

Fixes #8060

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210211165046.878371-1-nyh@scylladb.com>
(cherry picked from commit a03a8a89a9)
2021-02-14 13:06:59 +02:00
Botond Dénes
5b6c284281 query: use local limit for non-limited queries in mixed cluster
Since fea5067df we enforce a limit on the memory consumption of
otherwise non-limited queries like reverse and non-paged queries. This
limit is sent down to the replicas by the coordinator, ensuring that
each replica is working with the same limit. This however doesn't work
in a mixed cluster, when upgrading from a version which doesn't have
this series. This has been worked around by falling back to the old
max_result_size constant of 1MB in mixed clusters. This however resulted
in a regression when upgrading from a pre fea5067df to a post fea5067df
one. Pre fea5067df already had a limit for reverse queries, which was
generalized to also cover non-paged ones too by fea5067df.
The regression manifested in previously working reverse queries being
aborted. This happened because even though the user has set a generous
limit for them before the upgrade, in the mix cluster replicas fall back
to the much stricter 1MB limit temporarily ignoring the configured limit
if the coordinator is an old node. This patch solves this problem by
using the locally configured limit instead of the max_result_size
constant. This means that the user has to take extra care to configure
the same limit on all replicas, but at least they will have working
reverse queries during the upgrade.

Fixes: #8022

Tests: unit(release), manual test by user who reported the issue
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210209075947.1004164-1-bdenes@scylladb.com>
(cherry picked from commit 3d001b5587)
2021-02-09 18:06:43 +02:00
Yaron Kaikov
7d15319a8a release: prepare for 4.4
Update Docker parameters for the 4.4 release.

Closes #7932
2021-02-09 09:42:53 +02:00
Amnon Heiman
a06412fd24 API: Fix aggregation in column_familiy
Few method in column_familiy API were doing the aggregation wrong,
specifically, bloom filter disk size.

The issue is not always visible, it happens when there are multiple
filter files per shard.

Fixes #4513

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #8007

(cherry picked from commit 4498bb0a48)
2021-02-08 17:03:45 +02:00
Avi Kivity
2500dd1dc4 Merge 'dist/offline_installer/redhat: fix umask error' from Takuya ASADA
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.

Fixes #6243

Closes #6244

* github.com:scylladb/scylla:
  dist/offline_redhat: fix umask error
  dist/offline_installer/redhat: support cross build

(cherry picked from commit bb202db1ff)
2021-02-01 13:03:06 +02:00
Hagit Segev
fd868722dd release: prepare for 4.4.rc1 2021-01-31 14:09:44 +02:00
Pekka Enberg
f470c5d4de Update tools/python3 submodule
* tools/python3 c579207...199ac90 (1):
  > dist: debian: adjust .orig tarball name for .rc releases
2021-01-25 09:26:33 +02:00
Pekka Enberg
3677a72a21 Update tools/python3 submodule
* tools/python3 1763a1a...c579207 (1):
  > dist/debian: handle rc version correctly
2021-01-22 09:36:54 +02:00
Hagit Segev
46e6273821 release: prepare for 4.4.rc0 2021-01-18 20:29:53 +02:00
Jenkins Promoter
ce7e31013c release: prepare for 4.4 2021-01-18 15:49:55 +02:00
91 changed files with 2905 additions and 1210 deletions

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -498,6 +498,7 @@ set(scylla_sources
mutation_writer/multishard_writer.cc
mutation_writer/shard_based_splitting_writer.cc
mutation_writer/timestamp_based_splitting_writer.cc
mutation_writer/feed_writers.cc
partition_slice_builder.cc
partition_version.cc
querier.cc

View File

@@ -1,7 +1,7 @@
#!/bin/sh
PRODUCT=scylla
VERSION=4.4.dev
VERSION=4.4.0
if test -f version
then

View File

@@ -159,23 +159,40 @@ static bool check_NE(const rjson::value* v1, const rjson::value& v2) {
}
// Check if two JSON-encoded values match with the BEGINS_WITH relation
static bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2) {
// BEGINS_WITH requires that its single operand (v2) be a string or
// binary - otherwise it's a validation error. However, problems with
// the stored attribute (v1) will just return false (no match).
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error::validation(format("BEGINS_WITH operator encountered malformed AttributeValue: {}", v2));
}
auto it2 = v2.MemberBegin();
if (it2->name != "S" && it2->name != "B") {
throw api_error::validation(format("BEGINS_WITH operator requires String or Binary type in AttributeValue, got {}", it2->name));
}
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
if (v1_from_query) {
throw api_error::validation("begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v1->MemberBegin()->name != "S" && v1->MemberBegin()->name != "B") {
if (v1_from_query) {
throw api_error::validation(format("begins_with supports only string or binary type, got: {}", *v1));
} else {
bad = true;
}
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
if (v2_from_query) {
throw api_error::validation("begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
if (v2_from_query) {
throw api_error::validation(format("begins_with() supports only string or binary type, got: {}", v2));
} else {
bad = true;
}
}
if (bad) {
return false;
}
auto it1 = v1->MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name != it2->name) {
return false;
}
@@ -279,24 +296,38 @@ static bool check_NOT_NULL(const rjson::value* val) {
return val != nullptr;
}
// Only types S, N or B (string, number or bytes) may be compared by the
// various comparion operators - lt, le, gt, ge, and between.
static bool check_comparable_type(const rjson::value& v) {
if (!v.IsObject() || v.MemberCount() != 1) {
return false;
}
const rjson::value& type = v.MemberBegin()->name;
return type == "S" || type == "N" || type == "B";
}
// Check if two JSON-encoded values match with cmp.
template <typename Comparator>
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp) {
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error::validation(
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !check_comparable_type(*v1)) {
if (v1_from_query) {
throw api_error::validation(format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
}
const auto& kv2 = *v2.MemberBegin();
if (kv2.name != "S" && kv2.name != "N" && kv2.name != "B") {
throw api_error::validation(
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
if (!check_comparable_type(v2)) {
if (v2_from_query) {
throw api_error::validation(format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
}
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
if (bad) {
return false;
}
const auto& kv1 = *v1->MemberBegin();
const auto& kv2 = *v2.MemberBegin();
if (kv1.name != kv2.name) {
return false;
}
@@ -310,7 +341,8 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
if (kv1.name == "B") {
return cmp(base64_decode(kv1.value), base64_decode(kv2.value));
}
clogger.error("check_compare panic: LHS type equals RHS type, but one is in {N,S,B} while the other isn't");
// cannot reach here, as check_comparable_type() verifies the type is one
// of the above options.
return false;
}
@@ -341,56 +373,71 @@ struct cmp_gt {
static constexpr const char* diagnostic = "GT operator";
};
// True if v is between lb and ub, inclusive. Throws if lb > ub.
// True if v is between lb and ub, inclusive. Throws or returns false
// (depending on bounds_from_query parameter) if lb > ub.
template <typename T>
static bool check_BETWEEN(const T& v, const T& lb, const T& ub) {
static bool check_BETWEEN(const T& v, const T& lb, const T& ub, bool bounds_from_query) {
if (cmp_lt()(ub, lb)) {
throw api_error::validation(
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
if (bounds_from_query) {
throw api_error::validation(
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
} else {
return false;
}
}
return cmp_ge()(v, lb) && cmp_le()(v, ub);
}
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub) {
if (!v) {
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub,
bool v_from_query, bool lb_from_query, bool ub_from_query) {
if ((v && v_from_query && !check_comparable_type(*v)) ||
(lb_from_query && !check_comparable_type(lb)) ||
(ub_from_query && !check_comparable_type(ub))) {
throw api_error::validation("between allow only the types String, Number, or Binary");
}
if (!v || !v->IsObject() || v->MemberCount() != 1 ||
!lb.IsObject() || lb.MemberCount() != 1 ||
!ub.IsObject() || ub.MemberCount() != 1) {
return false;
}
if (!v->IsObject() || v->MemberCount() != 1) {
throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", *v));
}
if (!lb.IsObject() || lb.MemberCount() != 1) {
throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", lb));
}
if (!ub.IsObject() || ub.MemberCount() != 1) {
throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", ub));
}
const auto& kv_v = *v->MemberBegin();
const auto& kv_lb = *lb.MemberBegin();
const auto& kv_ub = *ub.MemberBegin();
bool bounds_from_query = lb_from_query && ub_from_query;
if (kv_lb.name != kv_ub.name) {
throw api_error::validation(
if (bounds_from_query) {
throw api_error::validation(
format("BETWEEN operator requires the same type for lower and upper bound; instead got {} and {}",
kv_lb.name, kv_ub.name));
} else {
return false;
}
}
if (kv_v.name != kv_lb.name) { // Cannot compare different types, so v is NOT between lb and ub.
return false;
}
if (kv_v.name == "N") {
const char* diag = "BETWEEN operator";
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag));
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag), bounds_from_query);
}
if (kv_v.name == "S") {
return check_BETWEEN(std::string_view(kv_v.value.GetString(), kv_v.value.GetStringLength()),
std::string_view(kv_lb.value.GetString(), kv_lb.value.GetStringLength()),
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()));
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()),
bounds_from_query);
}
if (kv_v.name == "B") {
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value));
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value), bounds_from_query);
}
throw api_error::validation(
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
if (v_from_query) {
throw api_error::validation(
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
kv_lb.name));
} else {
return false;
}
}
// Verify one Expect condition on one attribute (whose content is "got")
@@ -437,19 +484,19 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NE(got, (*attribute_value_list)[0]);
case comparison_operator_type::LT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_lt{});
return check_compare(got, (*attribute_value_list)[0], cmp_lt{}, false, true);
case comparison_operator_type::LE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_le{});
return check_compare(got, (*attribute_value_list)[0], cmp_le{}, false, true);
case comparison_operator_type::GT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_gt{});
return check_compare(got, (*attribute_value_list)[0], cmp_gt{}, false, true);
case comparison_operator_type::GE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_ge{});
return check_compare(got, (*attribute_value_list)[0], cmp_ge{}, false, true);
case comparison_operator_type::BEGINS_WITH:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_BEGINS_WITH(got, (*attribute_value_list)[0]);
return check_BEGINS_WITH(got, (*attribute_value_list)[0], false, true);
case comparison_operator_type::IN:
verify_operand_count(attribute_value_list, nonempty(), *comparison_operator);
return check_IN(got, *attribute_value_list);
@@ -461,7 +508,8 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NOT_NULL(got);
case comparison_operator_type::BETWEEN:
verify_operand_count(attribute_value_list, exact_size(2), *comparison_operator);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1]);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1],
false, true, true);
case comparison_operator_type::CONTAINS:
{
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
@@ -573,7 +621,8 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Wrong number of values {} in BETWEEN primitive_condition", cond._values.size()));
}
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2]);
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2],
cond._values[0].is_constant(), cond._values[1].is_constant(), cond._values[2].is_constant());
case parsed::primitive_condition::type::IN:
return check_IN(calculated_values);
case parsed::primitive_condition::type::VALUE:
@@ -604,13 +653,17 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
case parsed::primitive_condition::type::NE:
return check_NE(&calculated_values[0], calculated_values[1]);
case parsed::primitive_condition::type::GT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::GE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::LT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::LE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{},
cond._values[0].is_constant(), cond._values[1].is_constant());
default:
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Unknown type {} in primitive_condition object", (int)(cond._op)));

View File

@@ -52,6 +52,7 @@ bool verify_expected(const rjson::value& req, const rjson::value* previous_item)
bool verify_condition(const rjson::value& condition, bool require_all, const rjson::value* previous_item);
bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2);
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2, bool v1_from_query, bool v2_from_query);
bool verify_condition_expression(
const parsed::condition_expression& condition_expression,

View File

@@ -202,7 +202,7 @@ static schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& r
if (!schema) {
// if we get here then the name was missing, since syntax or missing actual CF
// checks throw. Slow path, but just call get_table_name to generate exception.
get_table_name(request);
get_table_name(request);
}
return schema;
}
@@ -1882,18 +1882,182 @@ static std::string get_item_type_string(const rjson::value& v) {
return it->name.GetString();
}
// attrs_to_get saves for each top-level attribute an attrs_to_get_node,
// a hierarchy of subparts that need to be kept. The following function
// takes a given JSON value and drops its parts which weren't asked to be
// kept. It modifies the given JSON value, or returns false to signify that
// the entire object should be dropped.
// Note that The JSON value is assumed to be encoded using the DynamoDB
// conventions - i.e., it is really a map whose key has a type string,
// and the value is the real object.
template<typename T>
static bool hierarchy_filter(rjson::value& val, const attribute_path_map_node<T>& h) {
if (!val.IsObject() || val.MemberCount() != 1) {
// This shouldn't happen. We shouldn't have stored malformed objects.
// But today Alternator does not validate the structure of nested
// documents before storing them, so this can happen on read.
throw api_error::internal(format("Malformed value object read: {}", val));
}
const char* type = val.MemberBegin()->name.GetString();
rjson::value& v = val.MemberBegin()->value;
if (h.has_members()) {
const auto& members = h.get_members();
if (type[0] != 'M' || !v.IsObject()) {
// If v is not an object (dictionary, map), none of the members
// can match.
return false;
}
rjson::value newv = rjson::empty_object();
for (auto it = v.MemberBegin(); it != v.MemberEnd(); ++it) {
std::string attr = it->name.GetString();
auto x = members.find(attr);
if (x != members.end()) {
if (x->second) {
// Only a part of this attribute is to be filtered, do it.
if (hierarchy_filter(it->value, *x->second)) {
rjson::set_with_string_name(newv, attr, std::move(it->value));
}
} else {
// The entire attribute is to be kept
rjson::set_with_string_name(newv, attr, std::move(it->value));
}
}
}
if (newv.MemberCount() == 0) {
return false;
}
v = newv;
} else if (h.has_indexes()) {
const auto& indexes = h.get_indexes();
if (type[0] != 'L' || !v.IsArray()) {
return false;
}
rjson::value newv = rjson::empty_array();
const auto& a = v.GetArray();
for (unsigned i = 0; i < v.Size(); i++) {
auto x = indexes.find(i);
if (x != indexes.end()) {
if (x->second) {
if (hierarchy_filter(a[i], *x->second)) {
rjson::push_back(newv, std::move(a[i]));
}
} else {
// The entire attribute is to be kept
rjson::push_back(newv, std::move(a[i]));
}
}
}
if (newv.Size() == 0) {
return false;
}
v = newv;
}
return true;
}
// Add a path to a attribute_path_map. Throws a validation error if the path
// "overlaps" with one already in the filter (one is a sub-path of the other)
// or "conflicts" with it (both a member and index is requested).
template<typename T>
void attribute_path_map_add(const char* source, attribute_path_map<T>& map, const parsed::path& p, T value = {}) {
using node = attribute_path_map_node<T>;
// The first step is to look for the top-level attribute (p.root()):
auto it = map.find(p.root());
if (it == map.end()) {
if (p.has_operators()) {
it = map.emplace(p.root(), node {std::nullopt}).first;
} else {
(void) map.emplace(p.root(), node {std::move(value)}).first;
// Value inserted for top-level node. We're done.
return;
}
} else if(!p.has_operators()) {
// If p is top-level and we already have it or a part of it
// in map, it's a forbidden overlapping path.
throw api_error::validation(format(
"Invalid {}: two document paths overlap at {}", source, p.root()));
} else if (it->second.has_value()) {
// If we're here, it != map.end() && p.has_operators && it->second.has_value().
// This means the top-level attribute already has a value, and we're
// trying to add a non-top-level value. It's an overlap.
throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p.root()));
}
node* h = &it->second;
// The second step is to walk h from the top-level node to the inner node
// where we're supposed to insert the value:
for (const auto& op : p.operators()) {
std::visit(overloaded_functor {
[&] (const std::string& member) {
if (h->is_empty()) {
*h = node {typename node::members_t()};
} else if (h->has_indexes()) {
throw api_error::validation(format("Invalid {}: two document paths conflict at {}", source, p));
} else if (h->has_value()) {
throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p));
}
typename node::members_t& members = h->get_members();
auto it = members.find(member);
if (it == members.end()) {
it = members.insert({member, make_shared<node>()}).first;
}
h = it->second.get();
},
[&] (unsigned index) {
if (h->is_empty()) {
*h = node {typename node::indexes_t()};
} else if (h->has_members()) {
throw api_error::validation(format("Invalid {}: two document paths conflict at {}", source, p));
} else if (h->has_value()) {
throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p));
}
typename node::indexes_t& indexes = h->get_indexes();
auto it = indexes.find(index);
if (it == indexes.end()) {
it = indexes.insert({index, make_shared<node>()}).first;
}
h = it->second.get();
}
}, op);
}
// Finally, insert the value in the node h.
if (h->is_empty()) {
*h = node {std::move(value)};
} else {
throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p));
}
}
// A very simplified version of the above function for the special case of
// adding only top-level attribute. It's not only simpler, we also use a
// different error message, referring to a "duplicate attribute"instead of
// "overlapping paths". DynamoDB also has this distinction (errors in
// AttributesToGet refer to duplicates, not overlaps, but errors in
// ProjectionExpression refer to overlap - even if it's an exact duplicate).
template<typename T>
void attribute_path_map_add(const char* source, attribute_path_map<T>& map, const std::string& attr, T value = {}) {
using node = attribute_path_map_node<T>;
auto it = map.find(attr);
if (it == map.end()) {
map.emplace(attr, node {std::move(value)});
} else {
throw api_error::validation(format(
"Invalid {}: Duplicate attribute: {}", source, attr));
}
}
// calculate_attrs_to_get() takes either AttributesToGet or
// ProjectionExpression parameters (having both is *not* allowed),
// and returns the list of cells we need to read, or an empty set when
// *all* attributes are to be returned.
// In our current implementation, only top-level attributes are stored
// as cells, and nested documents are stored serialized as JSON.
// So this function currently returns only the the top-level attributes
// but we also need to add, after the query, filtering to keep only
// the parts of the JSON attributes that were chosen in the paths'
// operators. Because we don't have such filtering yet (FIXME), we fail here
// if the requested paths are anything but top-level attributes.
std::unordered_set<std::string> calculate_attrs_to_get(const rjson::value& req, std::unordered_set<std::string>& used_attribute_names) {
// However, in our current implementation, only top-level attributes are
// stored as separate cells - a nested document is stored serialized together
// (as JSON) in the same cell. So this function return a map - each key is the
// top-level attribute we will need need to read, and the value for each
// top-level attribute is the partial hierarchy (struct hierarchy_filter)
// that we will need to extract from that serialized JSON.
// For example, if ProjectionExpression lists a.b and a.c[2], we
// return one top-level attribute name, "a", with the value "{b, c[2]}".
static attrs_to_get calculate_attrs_to_get(const rjson::value& req, std::unordered_set<std::string>& used_attribute_names) {
const bool has_attributes_to_get = req.HasMember("AttributesToGet");
const bool has_projection_expression = req.HasMember("ProjectionExpression");
if (has_attributes_to_get && has_projection_expression) {
@@ -1902,9 +2066,9 @@ std::unordered_set<std::string> calculate_attrs_to_get(const rjson::value& req,
}
if (has_attributes_to_get) {
const rjson::value& attributes_to_get = req["AttributesToGet"];
std::unordered_set<std::string> ret;
attrs_to_get ret;
for (auto it = attributes_to_get.Begin(); it != attributes_to_get.End(); ++it) {
ret.insert(it->GetString());
attribute_path_map_add("AttributesToGet", ret, it->GetString());
}
return ret;
} else if (has_projection_expression) {
@@ -1917,24 +2081,13 @@ std::unordered_set<std::string> calculate_attrs_to_get(const rjson::value& req,
throw api_error::validation(e.what());
}
resolve_projection_expression(paths_to_get, expression_attribute_names, used_attribute_names);
std::unordered_set<std::string> seen_column_names;
auto ret = boost::copy_range<std::unordered_set<std::string>>(paths_to_get |
boost::adaptors::transformed([&] (const parsed::path& p) {
if (p.has_operators()) {
// FIXME: this check will need to change when we support non-toplevel attributes
throw api_error::validation("Non-toplevel attributes in ProjectionExpression not yet implemented");
}
if (!seen_column_names.insert(p.root()).second) {
// FIXME: this check will need to change when we support non-toplevel attributes
throw api_error::validation(
format("Invalid ProjectionExpression: two document paths overlap with each other: {} and {}.",
p.root(), p.root()));
}
return p.root();
}));
attrs_to_get ret;
for (const parsed::path& p : paths_to_get) {
attribute_path_map_add("ProjectionExpression", ret, p);
}
return ret;
}
// An empty set asks to read everything
// An empty map asks to read everything
return {};
}
@@ -1955,7 +2108,7 @@ std::unordered_set<std::string> calculate_attrs_to_get(const rjson::value& req,
*/
void executor::describe_single_item(const cql3::selection::selection& selection,
const std::vector<bytes_opt>& result_row,
const std::unordered_set<std::string>& attrs_to_get,
const attrs_to_get& attrs_to_get,
rjson::value& item,
bool include_all_embedded_attributes)
{
@@ -1976,7 +2129,16 @@ void executor::describe_single_item(const cql3::selection::selection& selection,
std::string attr_name = value_cast<sstring>(entry.first);
if (include_all_embedded_attributes || attrs_to_get.empty() || attrs_to_get.contains(attr_name)) {
bytes value = value_cast<bytes>(entry.second);
rjson::set_with_string_name(item, attr_name, deserialize_item(value));
rjson::value v = deserialize_item(value);
auto it = attrs_to_get.find(attr_name);
if (it != attrs_to_get.end()) {
// attrs_to_get may have asked for only part of this attribute:
if (hierarchy_filter(v, it->second)) {
rjson::set_with_string_name(item, attr_name, std::move(v));
}
} else {
rjson::set_with_string_name(item, attr_name, std::move(v));
}
}
}
}
@@ -1988,7 +2150,7 @@ std::optional<rjson::value> executor::describe_single_item(schema_ptr schema,
const query::partition_slice& slice,
const cql3::selection::selection& selection,
const query::result& query_result,
const std::unordered_set<std::string>& attrs_to_get) {
const attrs_to_get& attrs_to_get) {
rjson::value item = rjson::empty_object();
cql3::selection::result_set_builder builder(selection, gc_clock::now(), cql_serialization_format::latest());
@@ -2024,8 +2186,16 @@ static bool check_needs_read_before_write(const parsed::value& v) {
}, v._value);
}
static bool check_needs_read_before_write(const parsed::update_expression& update_expression) {
return boost::algorithm::any_of(update_expression.actions(), [](const parsed::update_expression::action& action) {
static bool check_needs_read_before_write(const attribute_path_map<parsed::update_expression::action>& update_expression) {
return boost::algorithm::any_of(update_expression, [](const auto& p) {
if (!p.second.has_value()) {
// If the action is not on the top-level attribute, we need to
// read the old item: we change only a part of the top-level
// attribute, and write the full top-level attribute back.
return true;
}
// Otherwise, the action p.second.get_value() is just on top-level
// attribute. Check if it needs read-before-write:
return std::visit(overloaded_functor {
[&] (const parsed::update_expression::action::set& a) -> bool {
return check_needs_read_before_write(a._rhs._v1) || (a._rhs._op != 'v' && check_needs_read_before_write(a._rhs._v2));
@@ -2039,7 +2209,7 @@ static bool check_needs_read_before_write(const parsed::update_expression& updat
[&] (const parsed::update_expression::action::del& a) -> bool {
return true;
}
}, action._action);
}, p.second.get_value()._action);
});
}
@@ -2048,7 +2218,11 @@ public:
// Some information parsed during the constructor to check for input
// errors, and cached to be used again during apply().
rjson::value* _attribute_updates;
parsed::update_expression _update_expression;
// Instead of keeping a parsed::update_expression with an unsorted list
// list of actions, we keep them in an attribute_path_map which groups
// them by top-level attribute, and detects forbidden overlaps/conflicts.
attribute_path_map<parsed::update_expression::action> _update_expression;
parsed::condition_expression _condition_expression;
update_item_operation(service::storage_proxy& proxy, rjson::value&& request);
@@ -2079,16 +2253,22 @@ update_item_operation::update_item_operation(service::storage_proxy& proxy, rjso
throw api_error::validation("UpdateExpression must be a string");
}
try {
_update_expression = parse_update_expression(update_expression->GetString());
resolve_update_expression(_update_expression,
parsed::update_expression expr = parse_update_expression(update_expression->GetString());
resolve_update_expression(expr,
expression_attribute_names, expression_attribute_values,
used_attribute_names, used_attribute_values);
if (expr.empty()) {
throw api_error::validation("Empty expression in UpdateExpression is not allowed");
}
for (auto& action : expr.actions()) {
// Unfortunately we need to copy the action's path, because
// we std::move the action object.
auto p = action._path;
attribute_path_map_add("UpdateExpression", _update_expression, p, std::move(action));
}
} catch(expressions_syntax_error& e) {
throw api_error::validation(e.what());
}
if (_update_expression.empty()) {
throw api_error::validation("Empty expression in UpdateExpression is not allowed");
}
}
_attribute_updates = rjson::find(_request, "AttributeUpdates");
if (_attribute_updates) {
@@ -2130,6 +2310,187 @@ update_item_operation::needs_read_before_write() const {
(_returnvalues != returnvalues::NONE && _returnvalues != returnvalues::UPDATED_NEW);
}
// action_result() returns the result of applying an UpdateItem action -
// this result is either a JSON object or an unset optional which indicates
// the action was a deletion. The caller (update_item_operation::apply()
// below) will either write this JSON as the content of a column, or
// use it as a piece in a bigger top-level attribute.
static std::optional<rjson::value> action_result(
const parsed::update_expression::action& action,
const rjson::value* previous_item) {
return std::visit(overloaded_functor {
[&] (const parsed::update_expression::action::set& a) -> std::optional<rjson::value> {
return calculate_value(a._rhs, previous_item);
},
[&] (const parsed::update_expression::action::remove& a) -> std::optional<rjson::value> {
return std::nullopt;
},
[&] (const parsed::update_expression::action::add& a) -> std::optional<rjson::value> {
parsed::value base;
parsed::value addition;
base.set_path(action._path);
addition.set_constant(a._valref);
rjson::value v1 = calculate_value(base, calculate_value_caller::UpdateExpression, previous_item);
rjson::value v2 = calculate_value(addition, calculate_value_caller::UpdateExpression, previous_item);
rjson::value result;
// An ADD can be used to create a new attribute (when
// v1.IsNull()) or to add to a pre-existing attribute:
if (v1.IsNull()) {
std::string v2_type = get_item_type_string(v2);
if (v2_type == "N" || v2_type == "SS" || v2_type == "NS" || v2_type == "BS") {
result = v2;
} else {
throw api_error::validation(format("An operand in the update expression has an incorrect data type: {}", v2));
}
} else {
std::string v1_type = get_item_type_string(v1);
if (v1_type == "N") {
if (get_item_type_string(v2) != "N") {
throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = number_add(v1, v2);
} else if (v1_type == "SS" || v1_type == "NS" || v1_type == "BS") {
if (get_item_type_string(v2) != v1_type) {
throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = set_sum(v1, v2);
} else {
throw api_error::validation(format("An operand in the update expression has an incorrect data type: {}", v1));
}
}
return result;
},
[&] (const parsed::update_expression::action::del& a) -> std::optional<rjson::value> {
parsed::value base;
parsed::value subset;
base.set_path(action._path);
subset.set_constant(a._valref);
rjson::value v1 = calculate_value(base, calculate_value_caller::UpdateExpression, previous_item);
rjson::value v2 = calculate_value(subset, calculate_value_caller::UpdateExpression, previous_item);
if (!v1.IsNull()) {
return set_diff(v1, v2);
}
// When we return nullopt here, we ask to *delete* this attribute,
// which is unnecessary because we know the attribute does not
// exist anyway. This is a waste, but a small one. Note that also
// for the "remove" action above we don't bother to check if the
// previous_item add anything to remove.
return std::nullopt;
}
}, action._action);
}
// Print an attribute_path_map_node<action> as the list of paths it contains:
static std::ostream& operator<<(std::ostream& out, const attribute_path_map_node<parsed::update_expression::action>& h) {
if (h.has_value()) {
out << " " << h.get_value()._path;
} else if (h.has_members()) {
for (auto& member : h.get_members()) {
out << *member.second;
}
} else if (h.has_indexes()) {
for (auto& index : h.get_indexes()) {
out << *index.second;
}
}
return out;
}
// Apply the hierarchy of actions in an attribute_path_map_node<action> to a
// JSON object which uses DynamoDB's serialization conventions. The complete,
// unmodified, previous_item is also necessary for the right-hand sides of the
// actions. Modifies obj in-place or returns false if it is to be removed.
static bool hierarchy_actions(
rjson::value& obj,
const attribute_path_map_node<parsed::update_expression::action>& h,
const rjson::value* previous_item)
{
if (!obj.IsObject() || obj.MemberCount() != 1) {
// This shouldn't happen. We shouldn't have stored malformed objects.
// But today Alternator does not validate the structure of nested
// documents before storing them, so this can happen on read.
throw api_error::validation(format("Malformed value object read: {}", obj));
}
const char* type = obj.MemberBegin()->name.GetString();
rjson::value& v = obj.MemberBegin()->value;
if (h.has_value()) {
// Action replacing everything in this position in the hierarchy
std::optional<rjson::value> newv = action_result(h.get_value(), previous_item);
if (newv) {
obj = std::move(*newv);
} else {
return false;
}
} else if (h.has_members()) {
if (type[0] != 'M' || !v.IsObject()) {
// A .something on a non-map doesn't work.
throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));
}
for (const auto& member : h.get_members()) {
std::string attr = member.first;
const attribute_path_map_node<parsed::update_expression::action>& subh = *member.second;
rjson::value *subobj = rjson::find(v, attr);
if (subobj) {
if (!hierarchy_actions(*subobj, subh, previous_item)) {
rjson::remove_member(v, attr);
}
} else {
// When a.b does not exist, setting a.b itself (i.e.
// subh.has_value()) is fine, but setting a.b.c is not.
if (subh.has_value()) {
std::optional<rjson::value> newv = action_result(subh.get_value(), previous_item);
if (newv) {
rjson::set_with_string_name(v, attr, std::move(*newv));
} else {
throw api_error::validation(format("Can't remove document path {} - not present in item",
subh.get_value()._path));
}
} else {
throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));
}
}
}
} else if (h.has_indexes()) {
if (type[0] != 'L' || !v.IsArray()) {
// A [i] on a non-list doesn't work.
throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));
}
unsigned nremoved = 0;
for (const auto& index : h.get_indexes()) {
unsigned i = index.first - nremoved;
const attribute_path_map_node<parsed::update_expression::action>& subh = *index.second;
if (i < v.Size()) {
if (!hierarchy_actions(v[i], subh, previous_item)) {
v.Erase(v.Begin() + i);
// If we have the actions "REMOVE a[1] SET a[3] = :val",
// the index 3 refers to the original indexes, before any
// items were removed. So we offset the next indexes
// (which are guaranteed to be higher than i - indexes is
// a sorted map) by an increased "nremoved".
nremoved++;
}
} else {
// If a[7] does not exist, setting a[7] itself (i.e.
// subh.has_value()) is fine - and appends an item, though
// not necessarily with index 7. But setting a[7].b will
// not work.
if (subh.has_value()) {
std::optional<rjson::value> newv = action_result(subh.get_value(), previous_item);
if (newv) {
rjson::push_back(v, std::move(*newv));
} else {
// Removing a[7] when the list has fewer elements is
// silently ignored. It's not considered an error.
}
} else {
throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));
}
}
}
}
return true;
}
std::optional<mutation>
update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const {
if (!verify_expected(_request, previous_item.get()) ||
@@ -2144,17 +2505,37 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
auto& row = m.partition().clustered_row(*_schema, _ck);
attribute_collector attrs_collector;
bool any_updates = false;
auto do_update = [&] (bytes&& column_name, const rjson::value& json_value) {
auto do_update = [&] (bytes&& column_name, const rjson::value& json_value,
const attribute_path_map_node<parsed::update_expression::action>* h = nullptr) {
any_updates = true;
if (_returnvalues == returnvalues::ALL_NEW ||
_returnvalues == returnvalues::UPDATED_NEW) {
if (_returnvalues == returnvalues::ALL_NEW) {
rjson::set_with_string_name(_return_attributes,
to_sstring_view(column_name), rjson::copy(json_value));
to_sstring_view(column_name), rjson::copy(json_value));
} else if (_returnvalues == returnvalues::UPDATED_NEW) {
rjson::value&& v = rjson::copy(json_value);
if (h) {
// If the operation was only on specific attribute paths,
// leave only them in _return_attributes.
if (hierarchy_filter(v, *h)) {
rjson::set_with_string_name(_return_attributes,
to_sstring_view(column_name), std::move(v));
}
} else {
rjson::set_with_string_name(_return_attributes,
to_sstring_view(column_name), std::move(v));
}
} else if (_returnvalues == returnvalues::UPDATED_OLD && previous_item) {
std::string_view cn = to_sstring_view(column_name);
const rjson::value* col = rjson::find(*previous_item, cn);
if (col) {
rjson::set_with_string_name(_return_attributes, cn, rjson::copy(*col));
rjson::value&& v = rjson::copy(*col);
if (h) {
if (hierarchy_filter(v, *h)) {
rjson::set_with_string_name(_return_attributes, cn, std::move(v));
}
} else {
rjson::set_with_string_name(_return_attributes, cn, std::move(v));
}
}
}
const column_definition* cdef = _schema->get_column_definition(column_name);
@@ -2196,7 +2577,7 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
// can just move previous_item later, when we don't need it any more.
if (_returnvalues == returnvalues::ALL_NEW) {
if (previous_item) {
_return_attributes = std::move(*previous_item);
_return_attributes = rjson::copy(*previous_item);
} else {
// If there is no previous item, usually a new item is created
// and contains they given key. This may be cancelled at the end
@@ -2209,88 +2590,44 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
}
if (!_update_expression.empty()) {
std::unordered_set<std::string> seen_column_names;
for (auto& action : _update_expression.actions()) {
if (action._path.has_operators()) {
// FIXME: implement this case
throw api_error::validation("UpdateItem support for nested updates not yet implemented");
}
std::string column_name = action._path.root();
for (auto& actions : _update_expression) {
// The actions of _update_expression are grouped by top-level
// attributes. Here, all actions in actions.second share the same
// top-level attribute actions.first.
std::string column_name = actions.first;
const column_definition* cdef = _schema->get_column_definition(to_bytes(column_name));
if (cdef && cdef->is_primary_key()) {
throw api_error::validation(
format("UpdateItem cannot update key column {}", column_name));
throw api_error::validation(format("UpdateItem cannot update key column {}", column_name));
}
// DynamoDB forbids multiple updates in the same expression to
// modify overlapping document paths. Updates of one expression
// have the same timestamp, so it's unclear which would "win".
// FIXME: currently, without full support for document paths,
// we only check if the paths' roots are the same.
if (!seen_column_names.insert(column_name).second) {
throw api_error::validation(
format("Invalid UpdateExpression: two document paths overlap with each other: {} and {}.",
column_name, column_name));
}
std::visit(overloaded_functor {
[&] (const parsed::update_expression::action::set& a) {
auto value = calculate_value(a._rhs, previous_item.get());
do_update(to_bytes(column_name), value);
},
[&] (const parsed::update_expression::action::remove& a) {
if (actions.second.has_value()) {
// An action on a top-level attribute column_name. The single
// action is actions.second.get_value(). We can simply invoke
// the action and replace the attribute with its result:
std::optional<rjson::value> result = action_result(actions.second.get_value(), previous_item.get());
if (result) {
do_update(to_bytes(column_name), *result);
} else {
do_delete(to_bytes(column_name));
},
[&] (const parsed::update_expression::action::add& a) {
parsed::value base;
parsed::value addition;
base.set_path(action._path);
addition.set_constant(a._valref);
rjson::value v1 = calculate_value(base, calculate_value_caller::UpdateExpression, previous_item.get());
rjson::value v2 = calculate_value(addition, calculate_value_caller::UpdateExpression, previous_item.get());
rjson::value result;
// An ADD can be used to create a new attribute (when
// v1.IsNull()) or to add to a pre-existing attribute:
if (v1.IsNull()) {
std::string v2_type = get_item_type_string(v2);
if (v2_type == "N" || v2_type == "SS" || v2_type == "NS" || v2_type == "BS") {
result = v2;
} else {
throw api_error::validation(format("An operand in the update expression has an incorrect data type: {}", v2));
}
} else {
std::string v1_type = get_item_type_string(v1);
if (v1_type == "N") {
if (get_item_type_string(v2) != "N") {
throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = number_add(v1, v2);
} else if (v1_type == "SS" || v1_type == "NS" || v1_type == "BS") {
if (get_item_type_string(v2) != v1_type) {
throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = set_sum(v1, v2);
} else {
throw api_error::validation(format("An operand in the update expression has an incorrect data type: {}", v1));
}
}
do_update(to_bytes(column_name), result);
},
[&] (const parsed::update_expression::action::del& a) {
parsed::value base;
parsed::value subset;
base.set_path(action._path);
subset.set_constant(a._valref);
rjson::value v1 = calculate_value(base, calculate_value_caller::UpdateExpression, previous_item.get());
rjson::value v2 = calculate_value(subset, calculate_value_caller::UpdateExpression, previous_item.get());
if (!v1.IsNull()) {
std::optional<rjson::value> result = set_diff(v1, v2);
if (result) {
do_update(to_bytes(column_name), *result);
} else {
do_delete(to_bytes(column_name));
}
}
}
}, action._action);
} else {
// We have actions on a path or more than one path in the same
// top-level attribute column_name - but not on the top-level
// attribute as a whole. We already read the full top-level
// attribute (see check_needs_read_before_write()), and now we
// need to modify pieces of it and write back the entire
// top-level attribute.
if (!previous_item) {
throw api_error::validation(format("UpdateItem cannot update nested document path on non-existent item"));
}
const rjson::value *toplevel = rjson::find(*previous_item, column_name);
if (!toplevel) {
throw api_error::validation(format("UpdateItem cannot update document path: missing attribute {}",
column_name));
}
rjson::value result = rjson::copy(*toplevel);
hierarchy_actions(result, actions.second, previous_item.get());
do_update(to_bytes(column_name), std::move(result), &actions.second);
}
}
}
if (_returnvalues == returnvalues::ALL_OLD && previous_item) {
@@ -2408,7 +2745,7 @@ static rjson::value describe_item(schema_ptr schema,
const query::partition_slice& slice,
const cql3::selection::selection& selection,
const query::result& query_result,
const std::unordered_set<std::string>& attrs_to_get) {
const attrs_to_get& attrs_to_get) {
std::optional<rjson::value> opt_item = executor::describe_single_item(std::move(schema), slice, selection, std::move(query_result), attrs_to_get);
if (!opt_item) {
// If there is no matching item, we're supposed to return an empty
@@ -2480,7 +2817,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
struct table_requests {
schema_ptr schema;
db::consistency_level cl;
std::unordered_set<std::string> attrs_to_get;
attrs_to_get attrs_to_get;
struct single_request {
partition_key pk;
clustering_key ck;
@@ -2694,7 +3031,7 @@ void filter::for_filters_on(const noncopyable_function<void(std::string_view)>&
class describe_items_visitor {
typedef std::vector<const column_definition*> columns_t;
const columns_t& _columns;
const std::unordered_set<std::string>& _attrs_to_get;
const attrs_to_get& _attrs_to_get;
std::unordered_set<std::string> _extra_filter_attrs;
const filter& _filter;
typename columns_t::const_iterator _column_it;
@@ -2703,7 +3040,7 @@ class describe_items_visitor {
size_t _scanned_count;
public:
describe_items_visitor(const columns_t& columns, const std::unordered_set<std::string>& attrs_to_get, filter& filter)
describe_items_visitor(const columns_t& columns, const attrs_to_get& attrs_to_get, filter& filter)
: _columns(columns)
, _attrs_to_get(attrs_to_get)
, _filter(filter)
@@ -2752,6 +3089,12 @@ public:
std::string attr_name = value_cast<sstring>(entry.first);
if (_attrs_to_get.empty() || _attrs_to_get.contains(attr_name) || _extra_filter_attrs.contains(attr_name)) {
bytes value = value_cast<bytes>(entry.second);
// Even if _attrs_to_get asked to keep only a part of a
// top-level attribute, we keep the entire attribute
// at this stage, because the item filter might still
// need the other parts (it was easier for us to keep
// extra_filter_attrs at top-level granularity). We'll
// filter the unneeded parts after item filtering.
rjson::set_with_string_name(_item, attr_name, deserialize_item(value));
}
}
@@ -2762,11 +3105,24 @@ public:
void end_row() {
if (_filter.check(_item)) {
// As noted above, we kept entire top-level attributes listed in
// _attrs_to_get. We may need to only keep parts of them.
for (const auto& attr: _attrs_to_get) {
// If !attr.has_value() it means we were asked not to keep
// attr entirely, but just parts of it.
if (!attr.second.has_value()) {
rjson::value* toplevel= rjson::find(_item, attr.first);
if (toplevel && !hierarchy_filter(*toplevel, attr.second)) {
rjson::remove_member(_item, attr.first);
}
}
}
// Remove the extra attributes _extra_filter_attrs which we had
// to add just for the filter, and not requested to be returned:
for (const auto& attr : _extra_filter_attrs) {
rjson::remove_member(_item, attr);
}
rjson::push_back(_items, std::move(_item));
}
_item = rjson::empty_object();
@@ -2782,7 +3138,7 @@ public:
}
};
static rjson::value describe_items(schema_ptr schema, const query::partition_slice& slice, const cql3::selection::selection& selection, std::unique_ptr<cql3::result_set> result_set, std::unordered_set<std::string>&& attrs_to_get, filter&& filter) {
static rjson::value describe_items(schema_ptr schema, const query::partition_slice& slice, const cql3::selection::selection& selection, std::unique_ptr<cql3::result_set> result_set, attrs_to_get&& attrs_to_get, filter&& filter) {
describe_items_visitor visitor(selection.get_columns(), attrs_to_get, filter);
result_set->visit(visitor);
auto scanned_count = visitor.get_scanned_count();
@@ -2823,7 +3179,7 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
const rjson::value* exclusive_start_key,
dht::partition_range_vector&& partition_ranges,
std::vector<query::clustering_range>&& ck_bounds,
std::unordered_set<std::string>&& attrs_to_get,
attrs_to_get&& attrs_to_get,
uint32_t limit,
db::consistency_level cl,
filter&& filter,

View File

@@ -70,6 +70,76 @@ public:
std::string to_json() const override;
};
namespace parsed {
class path;
};
// An attribute_path_map object is used to hold data for various attributes
// paths (parsed::path) in a hierarchy of attribute paths. Each attribute path
// has a root attribute, and then modified by member and index operators -
// for example in "a.b[2].c" we have "a" as the root, then ".b" member, then
// "[2]" index, and finally ".c" member.
// Data can be added to an attribute_path_map using the add() function, but
// requires that attributes with data not be *overlapping* or *conflicting*:
//
// 1. Two attribute paths which are identical or an ancestor of one another
// are considered *overlapping* and not allowed. If a.b.c has data,
// we can't add more data in a.b.c or any of its descendants like a.b.c.d.
//
// 2. Two attribute paths which need the same parent to have both a member and
// an index are considered *conflicting* and not allowed. E.g., if a.b has
// data, you can't add a[1]. The meaning of adding both would be that the
// attribute a is both a map and an array, which isn't sensible.
//
// These two requirements are common to the two places where Alternator uses
// this abstraction to describe how a hierarchical item is to be transformed:
//
// 1. In ProjectExpression: for filtering from a full top-level attribute
// only the parts for which user asked in ProjectionExpression.
//
// 2. In UpdateExpression: for taking the previous value of a top-level
// attribute, and modifying it based on the instructions in the user
// wrote in UpdateExpression.
template<typename T>
class attribute_path_map_node {
public:
using data_t = T;
// We need the extra shared_ptr<> here because libstdc++ unordered_map
// doesn't work with incomplete types :-( We couldn't use lw_shared_ptr<>
// because it doesn't work for incomplete types either. We couldn't use
// std::unique_ptr<> because it makes the entire object uncopyable. We
// don't often need to copy such a map, but we do have some code that
// copies an attrs_to_get object, and is hard to find and remove.
// The shared_ptr should never be null.
using members_t = std::unordered_map<std::string, seastar::shared_ptr<attribute_path_map_node<T>>>;
// The indexes list is sorted because DynamoDB requires handling writes
// beyond the end of a list in index order.
using indexes_t = std::map<unsigned, seastar::shared_ptr<attribute_path_map_node<T>>>;
// The prohibition on "overlap" and "conflict" explained above means
// That only one of data, members or indexes is non-empty.
std::optional<std::variant<data_t, members_t, indexes_t>> _content;
bool is_empty() const { return !_content; }
bool has_value() const { return _content && std::holds_alternative<data_t>(*_content); }
bool has_members() const { return _content && std::holds_alternative<members_t>(*_content); }
bool has_indexes() const { return _content && std::holds_alternative<indexes_t>(*_content); }
// get_members() assumes that has_members() is true
members_t& get_members() { return std::get<members_t>(*_content); }
const members_t& get_members() const { return std::get<members_t>(*_content); }
indexes_t& get_indexes() { return std::get<indexes_t>(*_content); }
const indexes_t& get_indexes() const { return std::get<indexes_t>(*_content); }
T& get_value() { return std::get<T>(*_content); }
const T& get_value() const { return std::get<T>(*_content); }
};
template<typename T>
using attribute_path_map = std::unordered_map<std::string, attribute_path_map_node<T>>;
using attrs_to_get_node = attribute_path_map_node<std::monostate>;
using attrs_to_get = attribute_path_map<std::monostate>;
class executor : public peering_sharded_service<executor> {
service::storage_proxy& _proxy;
service::migration_manager& _mm;
@@ -140,16 +210,14 @@ public:
const query::partition_slice&,
const cql3::selection::selection&,
const query::result&,
const std::unordered_set<std::string>&);
const attrs_to_get&);
static void describe_single_item(const cql3::selection::selection&,
const std::vector<bytes_opt>&,
const std::unordered_set<std::string>&,
const attrs_to_get&,
rjson::value&,
bool = false);
void add_stream_options(const rjson::value& stream_spec, schema_builder&) const;
void supplement_table_info(rjson::value& descr, const schema& schema) const;
void supplement_table_stream_info(rjson::value& descr, const schema& schema) const;

View File

@@ -130,6 +130,27 @@ void condition_expression::append(condition_expression&& a, char op) {
}, _expression);
}
void path::check_depth_limit() {
if (1 + _operators.size() > depth_limit) {
throw expressions_syntax_error(format("Document path exceeded {} nesting levels", depth_limit));
}
}
std::ostream& operator<<(std::ostream& os, const path& p) {
os << p.root();
for (const auto& op : p.operators()) {
std::visit(overloaded_functor {
[&] (const std::string& member) {
os << '.' << member;
},
[&] (unsigned index) {
os << '[' << index << ']';
}
}, op);
}
return os;
}
} // namespace parsed
// The following resolve_*() functions resolve references in parsed
@@ -151,10 +172,9 @@ void condition_expression::append(condition_expression&& a, char op) {
// we need to resolve the expression just once but then use it many times
// (once for each item to be filtered).
static void resolve_path(parsed::path& p,
static std::optional<std::string> resolve_path_component(const std::string& column_name,
const rjson::value* expression_attribute_names,
std::unordered_set<std::string>& used_attribute_names) {
const std::string& column_name = p.root();
if (column_name.size() > 0 && column_name.front() == '#') {
if (!expression_attribute_names) {
throw api_error::validation(
@@ -166,7 +186,30 @@ static void resolve_path(parsed::path& p,
format("ExpressionAttributeNames missing entry '{}' required by expression", column_name));
}
used_attribute_names.emplace(column_name);
p.set_root(std::string(rjson::to_string_view(*value)));
return std::string(rjson::to_string_view(*value));
}
return std::nullopt;
}
static void resolve_path(parsed::path& p,
const rjson::value* expression_attribute_names,
std::unordered_set<std::string>& used_attribute_names) {
std::optional<std::string> r = resolve_path_component(p.root(), expression_attribute_names, used_attribute_names);
if (r) {
p.set_root(std::move(*r));
}
for (auto& op : p.operators()) {
std::visit(overloaded_functor {
[&] (std::string& s) {
r = resolve_path_component(s, expression_attribute_names, used_attribute_names);
if (r) {
s = std::move(*r);
}
},
[&] (unsigned index) {
// nothing to resolve
}
}, op);
}
}
@@ -603,52 +646,8 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
}
rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
rjson::value v2 = calculate_value(f._parameters[1], caller, previous_item);
// TODO: There's duplication here with check_BEGINS_WITH().
// But unfortunately, the two functions differ a bit.
// If one of v1 or v2 is malformed or has an unsupported type
// (not B or S), what we do depends on whether it came from
// the user's query (is_constant()), or the item. Unsupported
// values in the query result in an error, but if they are in
// the item, we silently return false (no match).
bool bad = false;
if (!v1.IsObject() || v1.MemberCount() != 1) {
bad = true;
if (f._parameters[0].is_constant()) {
throw api_error::validation(format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v1));
}
} else if (v1.MemberBegin()->name != "S" && v1.MemberBegin()->name != "B") {
bad = true;
if (f._parameters[0].is_constant()) {
throw api_error::validation(format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v1));
}
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
bad = true;
if (f._parameters[1].is_constant()) {
throw api_error::validation(format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v2));
}
} else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
bad = true;
if (f._parameters[1].is_constant()) {
throw api_error::validation(format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v2));
}
}
bool ret = false;
if (!bad) {
auto it1 = v1.MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name == it2->name) {
if (it2->name == "S") {
std::string_view val1 = rjson::to_string_view(it1->value);
std::string_view val2 = rjson::to_string_view(it2->value);
ret = val1.starts_with(val2);
} else /* it2->name == "B" */ {
ret = base64_begins_with(rjson::to_string_view(it1->value), rjson::to_string_view(it2->value));
}
}
}
return to_bool_json(ret);
return to_bool_json(check_BEGINS_WITH(v1.IsNull() ? nullptr : &v1, v2,
f._parameters[0].is_constant(), f._parameters[1].is_constant()));
}
},
{"contains", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
@@ -667,6 +666,55 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
},
};
// Given a parsed::path and an item read from the table, extract the value
// of a certain attribute path, such as "a" or "a.b.c[3]". Returns a null
// value if the item or the requested attribute does not exist.
// Note that the item is assumed to be encoded in JSON using DynamoDB
// conventions - each level of a nested document is a map with one key -
// a type (e.g., "M" for map) - and its value is the representation of
// that value.
static rjson::value extract_path(const rjson::value* item,
const parsed::path& p, calculate_value_caller caller) {
if (!item) {
return rjson::null_value();
}
const rjson::value* v = rjson::find(*item, p.root());
if (!v) {
return rjson::null_value();
}
for (const auto& op : p.operators()) {
if (!v->IsObject() || v->MemberCount() != 1) {
// This shouldn't happen. We shouldn't have stored malformed
// objects. But today Alternator does not validate the structure
// of nested documents before storing them, so this can happen on
// read.
throw api_error::validation(format("{}: malformed item read: {}", *item));
}
const char* type = v->MemberBegin()->name.GetString();
v = &(v->MemberBegin()->value);
std::visit(overloaded_functor {
[&] (const std::string& member) {
if (type[0] == 'M' && v->IsObject()) {
v = rjson::find(*v, member);
} else {
v = nullptr;
}
},
[&] (unsigned index) {
if (type[0] == 'L' && v->IsArray() && index < v->Size()) {
v = &(v->GetArray()[index]);
} else {
v = nullptr;
}
}
}, op);
if (!v) {
return rjson::null_value();
}
}
return rjson::copy(*v);
}
// Given a parsed::value, which can refer either to a constant value from
// ExpressionAttributeValues, to the value of some attribute, or to a function
// of other values, this function calculates the resulting value.
@@ -684,21 +732,12 @@ rjson::value calculate_value(const parsed::value& v,
auto function_it = function_handlers.find(std::string_view(f._function_name));
if (function_it == function_handlers.end()) {
throw api_error::validation(
format("UpdateExpression: unknown function '{}' called.", f._function_name));
format("{}: unknown function '{}' called.", caller, f._function_name));
}
return function_it->second(caller, previous_item, f);
},
[&] (const parsed::path& p) -> rjson::value {
if (!previous_item) {
return rjson::null_value();
}
std::string update_path = p.root();
if (p.has_operators()) {
// FIXME: support this
throw api_error::validation("Reading attribute paths not yet implemented");
}
const rjson::value* previous_value = rjson::find(*previous_item, update_path);
return previous_value ? rjson::copy(*previous_value) : rjson::null_value();
return extract_path(previous_item, p, caller);
}
}, v._value);
}

View File

@@ -49,15 +49,23 @@ class path {
// dot (e.g., ".xyz").
std::string _root;
std::vector<std::variant<std::string, unsigned>> _operators;
// It is useful to limit the depth of a user-specified path, because is
// allows us to use recursive algorithms without worrying about recursion
// depth. DynamoDB officially limits the length of paths to 32 components
// (including the root) so let's use the same limit.
static constexpr unsigned depth_limit = 32;
void check_depth_limit();
public:
void set_root(std::string root) {
_root = std::move(root);
}
void add_index(unsigned i) {
_operators.emplace_back(i);
check_depth_limit();
}
void add_dot(std::string(name)) {
_operators.emplace_back(std::move(name));
check_depth_limit();
}
const std::string& root() const {
return _root;
@@ -65,6 +73,13 @@ public:
bool has_operators() const {
return !_operators.empty();
}
const std::vector<std::variant<std::string, unsigned>>& operators() const {
return _operators;
}
std::vector<std::variant<std::string, unsigned>>& operators() {
return _operators;
}
friend std::ostream& operator<<(std::ostream&, const path&);
};
// When an expression is first parsed, all constants are references, like

View File

@@ -499,19 +499,11 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
// TODO: creation time
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
// cannot really "resume" query, must iterate all data. because we cannot query neither "time" (pk) > something,
// or on expired...
// TODO: maybe add secondary index to topology table to enable this?
return _sdks.cdc_get_versioned_streams({ normal_token_owners }).then([this, &db, schema, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc), ttl](std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {
// filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)
auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);
// filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)
auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);
auto i = topologies.lower_bound(low_ts);
// need first gen _intersecting_ the timestamp.
if (i != topologies.begin()) {
i = std::prev(i);
}
return _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners }).then([this, &db, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc)] (std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {
auto e = topologies.end();
auto prev = e;
@@ -519,9 +511,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
std::optional<shard_id> last;
// i is now at the youngest generation we include. make a mark of it.
auto first = i;
auto i = topologies.begin();
// if we're a paged query, skip to the generation where we left of.
if (shard_start) {
i = topologies.find(shard_start->time);
@@ -547,7 +537,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
};
// need a prev even if we are skipping stuff
if (i != first) {
if (i != topologies.begin()) {
prev = std::prev(i);
}
@@ -855,16 +845,18 @@ future<executor::request_return_type> executor::get_records(client_state& client
static const bytes op_column_name = cdc::log_meta_column_name_bytes("operation");
static const bytes eor_column_name = cdc::log_meta_column_name_bytes("end_of_batch");
auto key_names = boost::copy_range<std::unordered_set<std::string>>(
auto key_names = boost::copy_range<attrs_to_get>(
boost::range::join(std::move(base->partition_key_columns()), std::move(base->clustering_key_columns()))
| boost::adaptors::transformed([&] (const column_definition& cdef) { return cdef.name_as_text(); })
| boost::adaptors::transformed([&] (const column_definition& cdef) {
return std::make_pair<std::string, attrs_to_get_node>(cdef.name_as_text(), {}); })
);
// Include all base table columns as values (in case pre or post is enabled).
// This will include attributes not stored in the frozen map column
auto attr_names = boost::copy_range<std::unordered_set<std::string>>(base->regular_columns()
auto attr_names = boost::copy_range<attrs_to_get>(base->regular_columns()
// this will include the :attrs column, which we will also force evaluating.
// But not having this set empty forces out any cdc columns from actual result
| boost::adaptors::transformed([] (const column_definition& cdef) { return cdef.name_as_text(); })
| boost::adaptors::transformed([] (const column_definition& cdef) {
return std::make_pair<std::string, attrs_to_get_node>(cdef.name_as_text(), {}); })
);
std::vector<const column_definition*> columns;
@@ -1028,7 +1020,9 @@ future<executor::request_return_type> executor::get_records(client_state& client
}
// ugh. figure out if we are and end-of-shard
return cdc::get_local_streams_timestamp().then([this, iter, high_ts, start_time, ret = std::move(ret)](db_clock::time_point ts) mutable {
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
return _sdks.cdc_current_generation_timestamp({ normal_token_owners }).then([this, iter, high_ts, start_time, ret = std::move(ret)](db_clock::time_point ts) mutable {
auto& shard = iter.shard;
if (shard.time < ts && ts < high_ts) {

View File

@@ -1105,14 +1105,6 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"ignore_nodes",
"description":"List of dead nodes to ingore in removenode operation",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}

View File

@@ -656,7 +656,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_size();
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
});
@@ -664,7 +664,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_size();
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
});
@@ -672,7 +672,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_memory_size();
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
@@ -680,7 +680,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_memory_size();
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
@@ -688,7 +688,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->get_summary().memory_footprint();
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});
@@ -696,7 +696,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->get_summary().memory_footprint();
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});

View File

@@ -27,7 +27,6 @@
#include <time.h>
#include <boost/range/adaptor/map.hpp>
#include <boost/range/adaptor/filtered.hpp>
#include <boost/algorithm/string/trim_all.hpp>
#include "service/storage_service.hh"
#include "service/load_meter.hh"
#include "db/commitlog/commitlog.hh"
@@ -497,22 +496,7 @@ void set_storage_service(http_context& ctx, routes& r) {
ss::remove_node.set(r, [](std::unique_ptr<request> req) {
auto host_id = req->get_query_param("host_id");
std::vector<sstring> ignore_nodes_strs= split(req->get_query_param("ignore_nodes"), ",");
auto ignore_nodes = std::list<gms::inet_address>();
for (std::string n : ignore_nodes_strs) {
try {
std::replace(n.begin(), n.end(), '\"', ' ');
std::replace(n.begin(), n.end(), '\'', ' ');
boost::trim_all(n);
if (!n.empty()) {
auto node = gms::inet_address(n);
ignore_nodes.push_back(node);
}
} catch (...) {
throw std::runtime_error(format("Failed to parse ignore_nodes parameter: ignore_nodes={}, node={}", ignore_nodes_strs, n));
}
}
return service::get_local_storage_service().removenode(host_id, std::move(ignore_nodes)).then([] {
return service::get_local_storage_service().removenode(host_id).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});

View File

@@ -22,10 +22,14 @@
#include <boost/type.hpp>
#include <random>
#include <unordered_set>
#include <algorithm>
#include <seastar/core/sleep.hh>
#include <algorithm>
#include <seastar/core/coroutine.hh>
#include "keys.hh"
#include "schema_builder.hh"
#include "database.hh"
#include "db/config.hh"
#include "db/system_keyspace.hh"
#include "db/system_distributed_keyspace.hh"
@@ -36,6 +40,7 @@
#include "gms/gossiper.hh"
#include "cdc/generation.hh"
#include "cdc/cdc_options.hh"
extern logging::logger cdc_log;
@@ -174,10 +179,29 @@ bool topology_description::operator==(const topology_description& o) const {
return _entries == o._entries;
}
const std::vector<token_range_description>& topology_description::entries() const {
const std::vector<token_range_description>& topology_description::entries() const& {
return _entries;
}
std::vector<token_range_description>&& topology_description::entries() && {
return std::move(_entries);
}
static std::vector<stream_id> create_stream_ids(
size_t index, dht::token start, dht::token end, size_t shard_count, uint8_t ignore_msb) {
std::vector<stream_id> result;
result.reserve(shard_count);
dht::sharder sharder(shard_count, ignore_msb);
for (size_t shard_idx = 0; shard_idx < shard_count; ++shard_idx) {
auto t = dht::find_first_token_for_shard(sharder, start, end, shard_idx);
// compose the id from token and the "index" of the range end owning vnode
// as defined by token sort order. Basically grouping within this
// shard set.
result.emplace_back(stream_id(t, index));
}
return result;
}
class topology_description_generator final {
const db::config& _cfg;
const std::unordered_set<dht::token>& _bootstrap_tokens;
@@ -217,18 +241,9 @@ class topology_description_generator final {
desc.token_range_end = end;
auto [shard_count, ignore_msb] = get_sharding_info(end);
desc.streams.reserve(shard_count);
desc.streams = create_stream_ids(index, start, end, shard_count, ignore_msb);
desc.sharding_ignore_msb = ignore_msb;
dht::sharder sharder(shard_count, ignore_msb);
for (size_t shard_idx = 0; shard_idx < shard_count; ++shard_idx) {
auto t = dht::find_first_token_for_shard(sharder, start, end, shard_idx);
// compose the id from token and the "index" of the range end owning vnode
// as defined by token sort order. Basically grouping within this
// shard set.
desc.streams.emplace_back(stream_id(t, index));
}
return desc;
}
public:
@@ -294,6 +309,38 @@ future<db_clock::time_point> get_local_streams_timestamp() {
});
}
// non-static for testing
size_t limit_of_streams_in_topology_description() {
// Each stream takes 16B and we don't want to exceed 4MB so we can have
// at most 262144 streams but not less than 1 per vnode.
return 4 * 1024 * 1024 / 16;
}
// non-static for testing
topology_description limit_number_of_streams_if_needed(topology_description&& desc) {
int64_t streams_count = 0;
for (auto& tr_desc : desc.entries()) {
streams_count += tr_desc.streams.size();
}
size_t limit = std::max(limit_of_streams_in_topology_description(), desc.entries().size());
if (limit >= streams_count) {
return std::move(desc);
}
size_t streams_per_vnode_limit = limit / desc.entries().size();
auto entries = std::move(desc).entries();
auto start = entries.back().token_range_end;
for (size_t idx = 0; idx < entries.size(); ++idx) {
auto end = entries[idx].token_range_end;
if (entries[idx].streams.size() > streams_per_vnode_limit) {
entries[idx].streams =
create_stream_ids(idx, start, end, streams_per_vnode_limit, entries[idx].sharding_ignore_msb);
}
start = end;
}
return topology_description(std::move(entries));
}
// Run inside seastar::async context.
db_clock::time_point make_new_cdc_generation(
const db::config& cfg,
@@ -306,6 +353,18 @@ db_clock::time_point make_new_cdc_generation(
using namespace std::chrono;
auto gen = topology_description_generator(cfg, bootstrap_tokens, tmptr, g).generate();
// If the cluster is large we may end up with a generation that contains
// large number of streams. This is problematic because we store the
// generation in a single row. For a generation with large number of rows
// this will lead to a row that can be as big as 32MB. This is much more
// than the limit imposed by commitlog_segment_size_in_mb. If the size of
// the row that describes a new generation grows above
// commitlog_segment_size_in_mb, the write will fail and the new node won't
// be able to join. To avoid such problem we make sure that such row is
// always smaller than 4MB. We do that by removing some CDC streams from
// each vnode if the total number of streams is too large.
gen = limit_number_of_streams_if_needed(std::move(gen));
// Begin the race.
auto ts = db_clock::now() + (
(!add_delay || ring_delay == milliseconds(0)) ? milliseconds(0) : (
@@ -321,31 +380,23 @@ std::optional<db_clock::time_point> get_streams_timestamp_for(const gms::inet_ad
return gms::versioned_value::cdc_streams_timestamp_from_string(streams_ts_string);
}
// Run inside seastar::async context.
static void do_update_streams_description(
static future<> do_update_streams_description(
db_clock::time_point streams_ts,
db::system_distributed_keyspace& sys_dist_ks,
db::system_distributed_keyspace::context ctx) {
if (sys_dist_ks.cdc_desc_exists(streams_ts, ctx).get0()) {
cdc_log.debug("update_streams_description: description of generation {} already inserted", streams_ts);
return;
if (co_await sys_dist_ks.cdc_desc_exists(streams_ts, ctx)) {
cdc_log.info("Generation {}: streams description table already updated.", streams_ts);
co_return;
}
// We might race with another node also inserting the description, but that's ok. It's an idempotent operation.
auto topo = sys_dist_ks.read_cdc_topology_description(streams_ts, ctx).get0();
auto topo = co_await sys_dist_ks.read_cdc_topology_description(streams_ts, ctx);
if (!topo) {
throw std::runtime_error(format("could not find streams data for timestamp {}", streams_ts));
throw no_generation_data_exception(streams_ts);
}
std::set<cdc::stream_id> streams_set;
for (auto& entry: topo->entries()) {
streams_set.insert(entry.streams.begin(), entry.streams.end());
}
std::vector<cdc::stream_id> streams_vec(streams_set.begin(), streams_set.end());
sys_dist_ks.create_cdc_desc(streams_ts, streams_vec, ctx).get();
co_await sys_dist_ks.create_cdc_desc(streams_ts, *topo, ctx);
cdc_log.info("CDC description table successfully updated with generation {}.", streams_ts);
}
@@ -355,7 +406,7 @@ void update_streams_description(
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
try {
do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() }).get();
} catch(...) {
cdc_log.warn(
"Could not update CDC description table with generation {}: {}. Will retry in the background.",
@@ -368,7 +419,7 @@ void update_streams_description(
while (true) {
sleep_abortable(std::chrono::seconds(60), abort_src).get();
try {
do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() }).get();
return;
} catch (...) {
cdc_log.warn(
@@ -380,4 +431,176 @@ void update_streams_description(
}
}
static db_clock::time_point as_timepoint(const utils::UUID& uuid) {
return db_clock::time_point{std::chrono::milliseconds(utils::UUID_gen::get_adjusted_timestamp(uuid))};
}
static future<std::vector<db_clock::time_point>> get_cdc_desc_v1_timestamps(
db::system_distributed_keyspace& sys_dist_ks,
abort_source& abort_src,
const noncopyable_function<unsigned()>& get_num_token_owners) {
while (true) {
try {
co_return co_await sys_dist_ks.get_cdc_desc_v1_timestamps({ get_num_token_owners() });
} catch (...) {
cdc_log.warn(
"Failed to retrieve generation timestamps for rewriting: {}. Retrying in 60s.",
std::current_exception());
}
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
}
}
// Contains a CDC log table's creation time (extracted from its schema's id)
// and its CDC TTL setting.
struct time_and_ttl {
db_clock::time_point creation_time;
int ttl;
};
/*
* See `maybe_rewrite_streams_descriptions`.
* This is the long-running-in-the-background part of that function.
* It returns the timestamp of the last rewritten generation (if any).
*/
static future<std::optional<db_clock::time_point>> rewrite_streams_descriptions(
std::vector<time_and_ttl> times_and_ttls,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
cdc_log.info("Retrieving generation timestamps for rewriting...");
auto tss = co_await get_cdc_desc_v1_timestamps(*sys_dist_ks, abort_src, get_num_token_owners);
cdc_log.info("Generation timestamps retrieved.");
// Find first generation timestamp such that some CDC log table may contain data before this timestamp.
// This predicate is monotonic w.r.t the timestamps.
auto now = db_clock::now();
std::sort(tss.begin(), tss.end());
auto first = std::partition_point(tss.begin(), tss.end(), [&] (db_clock::time_point ts) {
// partition_point finds first element that does *not* satisfy the predicate.
return std::none_of(times_and_ttls.begin(), times_and_ttls.end(),
[&] (const time_and_ttl& tat) {
// In this CDC log table there are no entries older than the table's creation time
// or (now - the table's ttl). We subtract 10s to account for some possible clock drift.
// If ttl is set to 0 then entries in this table never expire. In that case we look
// only at the table's creation time.
auto no_entries_older_than =
(tat.ttl == 0 ? tat.creation_time : std::max(tat.creation_time, now - std::chrono::seconds(tat.ttl)))
- std::chrono::seconds(10);
return no_entries_older_than < ts;
});
});
// Find first generation timestamp such that some CDC log table may contain data in this generation.
// This and all later generations need to be written to the new streams table.
if (first != tss.begin()) {
--first;
}
if (first == tss.end()) {
cdc_log.info("No generations to rewrite.");
co_return std::nullopt;
}
cdc_log.info("First generation to rewrite: {}", *first);
bool each_success = true;
co_await max_concurrent_for_each(first, tss.end(), 10, [&] (db_clock::time_point ts) -> future<> {
while (true) {
try {
co_return co_await do_update_streams_description(ts, *sys_dist_ks, { get_num_token_owners() });
} catch (const no_generation_data_exception& e) {
cdc_log.error("Failed to rewrite streams for generation {}: {}. Giving up.", ts, e);
each_success = false;
co_return;
} catch (...) {
cdc_log.warn("Failed to rewrite streams for generation {}: {}. Retrying in 60s.", ts, std::current_exception());
}
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
}
});
if (each_success) {
cdc_log.info("Rewriting stream tables finished successfully.");
} else {
cdc_log.info("Rewriting stream tables finished, but some generations could not be rewritten (check the logs).");
}
if (first != tss.end()) {
co_return *std::prev(tss.end());
}
co_return std::nullopt;
}
future<> maybe_rewrite_streams_descriptions(
const database& db,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
if (!db.has_schema(sys_dist_ks->NAME, sys_dist_ks->CDC_DESC_V1)) {
// This cluster never went through a Scylla version which used this table
// or the user deleted the table. Nothing to do.
co_return;
}
if (co_await db::system_keyspace::cdc_is_rewritten()) {
co_return;
}
if (db.get_config().cdc_dont_rewrite_streams()) {
cdc_log.warn("Stream rewriting disabled. Manual administrator intervention may be required...");
co_return;
}
// For each CDC log table get the TTL setting (from CDC options) and the table's creation time
std::vector<time_and_ttl> times_and_ttls;
for (auto& [_, cf] : db.get_column_families()) {
auto& s = *cf->schema();
auto base = cdc::get_base_table(db, s.ks_name(), s.cf_name());
if (!base) {
// Not a CDC log table.
continue;
}
auto& cdc_opts = base->cdc_options();
if (!cdc_opts.enabled()) {
// This table is named like a CDC log table but it's not one.
continue;
}
times_and_ttls.push_back(time_and_ttl{as_timepoint(s.id()), cdc_opts.ttl()});
}
if (times_and_ttls.empty()) {
// There's no point in rewriting old generations' streams (they don't contain any data).
cdc_log.info("No CDC log tables present, not rewriting stream tables.");
co_return co_await db::system_keyspace::cdc_set_rewritten(std::nullopt);
}
// It's safe to discard this future: the coroutine keeps system_distributed_keyspace alive
// and the abort source's lifetime extends the lifetime of any other service.
(void)(([_times_and_ttls = std::move(times_and_ttls), _sys_dist_ks = std::move(sys_dist_ks),
_get_num_token_owners = std::move(get_num_token_owners), &_abort_src = abort_src] () mutable -> future<> {
auto times_and_ttls = std::move(_times_and_ttls);
auto sys_dist_ks = std::move(_sys_dist_ks);
auto get_num_token_owners = std::move(_get_num_token_owners);
auto& abort_src = _abort_src;
// This code is racing with node startup. At this point, we're most likely still waiting for gossip to settle
// and some nodes that are UP may still be marked as DOWN by us.
// Let's sleep a bit to increase the chance that the first attempt at rewriting succeeds (it's still ok if
// it doesn't - we'll retry - but it's nice if we succeed without any warnings).
co_await sleep_abortable(std::chrono::seconds(10), abort_src);
cdc_log.info("Rewriting stream tables in the background...");
auto last_rewritten = co_await rewrite_streams_descriptions(
std::move(times_and_ttls),
std::move(sys_dist_ks),
std::move(get_num_token_owners),
abort_src);
co_await db::system_keyspace::cdc_set_rewritten(last_rewritten);
})());
}
} // namespace cdc

View File

@@ -41,6 +41,7 @@
#include "db_clock.hh"
#include "dht/token.hh"
#include "locator/token_metadata.hh"
#include "utils/chunked_vector.hh"
namespace seastar {
class abort_source;
@@ -65,6 +66,7 @@ public:
stream_id() = default;
stream_id(bytes);
stream_id(dht::token, size_t);
bool is_set() const;
bool operator==(const stream_id&) const;
@@ -78,9 +80,6 @@ public:
partition_key to_partition_key(const schema& log_schema) const;
static int64_t token_from_bytes(bytes_view);
private:
friend class topology_description_generator;
stream_id(dht::token, size_t);
};
/* Describes a mapping of tokens to CDC streams in a token range.
@@ -113,7 +112,8 @@ public:
topology_description(std::vector<token_range_description> entries);
bool operator==(const topology_description&) const;
const std::vector<token_range_description>& entries() const;
const std::vector<token_range_description>& entries() const&;
std::vector<token_range_description>&& entries() &&;
};
/**
@@ -122,14 +122,19 @@ public:
*/
class streams_version {
public:
std::vector<stream_id> streams;
utils::chunked_vector<stream_id> streams;
db_clock::time_point timestamp;
std::optional<db_clock::time_point> expired;
streams_version(std::vector<stream_id> s, db_clock::time_point ts, std::optional<db_clock::time_point> exp)
streams_version(utils::chunked_vector<stream_id> s, db_clock::time_point ts)
: streams(std::move(s))
, timestamp(ts)
, expired(std::move(exp))
{}
};
class no_generation_data_exception : public std::runtime_error {
public:
no_generation_data_exception(db_clock::time_point generation_ts)
: std::runtime_error(format("could not find generation data for timestamp {}", generation_ts))
{}
};
@@ -194,4 +199,15 @@ void update_streams_description(
noncopyable_function<unsigned()> get_num_token_owners,
abort_source&);
/* Part of the upgrade procedure. Useful in case where the version of Scylla that we're upgrading from
* used the "cdc_streams_descriptions" table. This procedure ensures that the new "cdc_streams_descriptions_v2"
* table contains streams of all generations that were present in the old table and may still contain data
* (i.e. there exist CDC log tables that may contain rows with partition keys being the stream IDs from
* these generations). */
future<> maybe_rewrite_streams_descriptions(
const database&,
shared_ptr<db::system_distributed_keyspace>,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source&);
} // namespace cdc

View File

@@ -51,7 +51,8 @@ static cdc::stream_id get_stream(
return entry.streams[shard_id];
}
static cdc::stream_id get_stream(
// non-static for testing
cdc::stream_id get_stream(
const std::vector<cdc::token_range_description>& entries,
dht::token tok) {
if (entries.empty()) {

View File

@@ -278,6 +278,7 @@ modes = {
scylla_tests = set([
'test/boost/UUID_test',
'test/boost/cdc_generation_test',
'test/boost/aggregate_fcts_test',
'test/boost/allocation_strategy_test',
'test/boost/alternator_base64_test',
@@ -854,6 +855,7 @@ scylla_core = (['database.cc',
'utils/error_injection.cc',
'mutation_writer/timestamp_based_splitting_writer.cc',
'mutation_writer/shard_based_splitting_writer.cc',
'mutation_writer/feed_writers.cc',
'lua.cc',
] + [Antlr3Grammar('cql3/Cql.g')] + [Thrift('interface/cassandra.thrift', 'Cassandra')]
)

View File

@@ -29,6 +29,7 @@
#include "cql3/constants.hh"
#include "cql3/lists.hh"
#include "cql3/statements/request_validations.hh"
#include "cql3/tuples.hh"
#include "index/secondary_index_manager.hh"
#include "types/list.hh"
@@ -414,6 +415,8 @@ bool is_one_of(const column_value& col, term& rhs, const column_value_eval_bag&
} else if (auto mkr = dynamic_cast<lists::marker*>(&rhs)) {
// This is `a IN ?`. RHS elements are values representable as bytes_opt.
const auto values = static_pointer_cast<lists::value>(mkr->bind(bag.options));
statements::request_validations::check_not_null(
values, "Invalid null value for column %s", col.col->name_as_text());
return boost::algorithm::any_of(values->get_elements(), [&] (const bytes_opt& b) {
return equal(b, col, bag);
});
@@ -580,6 +583,7 @@ value_list get_IN_values(
if (val == constants::UNSET_VALUE) {
throw exceptions::invalid_request_exception(format("Invalid unset value for column {}", column_name));
}
statements::request_validations::check_not_null(val, "Invalid null value for column %s", column_name);
return to_sorted_vector(static_pointer_cast<lists::value>(val)->get_elements() | non_null | deref, comparator);
}
throw std::logic_error(format("get_IN_values(single column) on invalid term {}", *t));

View File

@@ -668,10 +668,14 @@ struct internal_query_state {
bool more_results = true;
};
::shared_ptr<internal_query_state> query_processor::create_paged_state(const sstring& query_string,
const std::initializer_list<data_value>& values, int32_t page_size) {
::shared_ptr<internal_query_state> query_processor::create_paged_state(
const sstring& query_string,
db::consistency_level cl,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& values,
int32_t page_size) {
auto p = prepare_internal(query_string);
auto opts = make_internal_options(p, values, db::consistency_level::ONE, infinite_timeout_config, page_size);
auto opts = make_internal_options(p, values, cl, timeout_config, page_size);
::shared_ptr<internal_query_state> res = ::make_shared<internal_query_state>(
internal_query_state{
query_string,
@@ -935,17 +939,20 @@ bool query_processor::migration_subscriber::should_invalidate(
return statement->depends_on_keyspace(ks_name) && (!cf_name || statement->depends_on_column_family(*cf_name));
}
future<> query_processor::query(
future<> query_processor::query_internal(
const sstring& query_string,
db::consistency_level cl,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f) {
return for_each_cql_result(create_paged_state(query_string, values), std::move(f));
return for_each_cql_result(create_paged_state(query_string, cl, timeout_config, values, page_size), std::move(f));
}
future<> query_processor::query(
future<> query_processor::query_internal(
const sstring& query_string,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f) {
return for_each_cql_result(create_paged_state(query_string, {}), std::move(f));
return query_internal(query_string, db::consistency_level::ONE, infinite_timeout_config, {}, 1000, std::move(f));
}
}

View File

@@ -224,75 +224,52 @@ public:
/*!
* \brief iterate over all cql results using paging
*
* You Create a statement with optional paraemter and pass
* a function that goes over the results.
* You create a statement with optional parameters and pass
* a function that goes over the result rows.
*
* The passed function would be called for all the results, return stop_iteration::yes
* to stop during iteration.
* The passed function would be called for all rows; return future<stop_iteration::yes>
* to stop iteration.
*
* For example:
return query("SELECT * from system.compaction_history",
[&history] (const cql3::untyped_result_set::row& row) mutable {
....
....
return stop_iteration::no;
});
* You can use place holder in the query, the prepared statement will only be done once.
*
*
* query_string - the cql string, can contain place holder
* f - a function to be run on each of the query result, if the function return false the iteration would stop
* args - arbitrary number of query parameters
*/
template<typename... Args>
future<> query(
const sstring& query_string,
std::function<stop_iteration(const cql3::untyped_result_set_row&)>&& f,
Args&&... args) {
return for_each_cql_result(
create_paged_state(query_string, { data_value(std::forward<Args>(args))... }), std::move(f));
}
/*!
* \brief iterate over all cql results using paging
*
* You Create a statement with optional paraemter and pass
* a function that goes over the results.
*
* The passed function would be called for all the results, return future<stop_iteration::yes>
* to stop during iteration.
*
* For example:
return query("SELECT * from system.compaction_history",
[&history] (const cql3::untyped_result_set::row& row) mutable {
return query_internal(
"SELECT * from system.compaction_history",
db::consistency_level::ONE,
infinite_timeout_config,
{},
[&history] (const cql3::untyped_result_set::row& row) mutable {
....
....
return make_ready_future<stop_iteration>(stop_iteration::no);
});
* You can use place holder in the query, the prepared statement will only be done once.
* You can use placeholders in the query, the statement will only be prepared once.
*
*
* query_string - the cql string, can contain place holder
* values - query parameters value
* f - a function to be run on each of the query result, if the function return stop_iteration::no the iteration
* would stop
* query_string - the cql string, can contain placeholders
* cl - consistency level of the query
* timeout_config - timeout configuration
* values - values to be substituted for the placeholders in the query
* page_size - maximum page size
* f - a function to be run on each row of the query result,
* if the function returns stop_iteration::yes the iteration will stop
*/
future<> query(
future<> query_internal(
const sstring& query_string,
db::consistency_level cl,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f);
/*
* \brief iterate over all cql results using paging
* An overload of the query with future function without query parameters.
* An overload of query_internal without query parameters
* using CL = ONE, no timeout, and page size = 1000.
*
* query_string - the cql string, can contain place holder
* f - a function to be run on each of the query result, if the function return stop_iteration::no the iteration
* would stop
* query_string - the cql string, can contain placeholders
* f - a function to be run on each row of the query result,
* if the function returns stop_iteration::yes the iteration will stop
*/
future<> query(
future<> query_internal(
const sstring& query_string,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f);
@@ -354,8 +331,10 @@ private:
*/
::shared_ptr<internal_query_state> create_paged_state(
const sstring& query_string,
const std::initializer_list<data_value>& = { },
int32_t page_size = 1000);
db::consistency_level,
const timeout_config&,
const std::initializer_list<data_value>&,
int32_t page_size);
/*!
* \brief run a query using paging

View File

@@ -780,6 +780,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"Time period in seconds after which unused schema versions will be evicted from the local schema registry cache. Default is 1 second.")
, max_concurrent_requests_per_shard(this, "max_concurrent_requests_per_shard",liveness::LiveUpdate, value_status::Used, std::numeric_limits<uint32_t>::max(),
"Maximum number of concurrent requests a single shard can handle before it starts shedding extra load. By default, no requests will be shed.")
, cdc_dont_rewrite_streams(this, "cdc_dont_rewrite_streams", value_status::Used, false,
"Disable rewriting streams from cdc_streams_descriptions to cdc_streams_descriptions_v2. Should not be necessary, but the procedure is expensive and prone to failures; this config option is left as a backdoor in case some user requires manual intervention.")
, alternator_port(this, "alternator_port", value_status::Used, 0, "Alternator API port")
, alternator_https_port(this, "alternator_https_port", value_status::Used, 0, "Alternator API HTTPS port")
, alternator_address(this, "alternator_address", value_status::Used, "0.0.0.0", "Alternator API listening address")

View File

@@ -322,6 +322,7 @@ public:
named_value<unsigned> user_defined_function_contiguous_allocation_limit_bytes;
named_value<uint32_t> schema_registry_grace_period;
named_value<uint32_t> max_concurrent_requests_per_shard;
named_value<bool> cdc_dont_rewrite_streams;
named_value<uint16_t> alternator_port;
named_value<uint16_t> alternator_https_port;

View File

@@ -120,10 +120,9 @@ future<> manager::start(shared_ptr<service::storage_proxy> proxy_ptr, shared_ptr
future<> manager::stop() {
manager_logger.info("Asked to stop");
if (_strorage_service_anchor) {
_strorage_service_anchor->unregister_subscriber(this);
}
auto f = _strorage_service_anchor ? _strorage_service_anchor->unregister_subscriber(this) : make_ready_future<>();
return f.finally([this] {
set_stopping();
return _draining_eps_gate.close().finally([this] {
@@ -134,6 +133,7 @@ future<> manager::stop() {
manager_logger.info("Stopped");
}).discard_result();
});
});
}
future<> manager::compute_hints_dir_device_id() {

View File

@@ -35,12 +35,14 @@
#include <seastar/core/seastar.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/core/future-util.hh>
#include <boost/range/adaptor/transformed.hpp>
#include <optional>
#include <vector>
#include <optional>
#include <set>
extern logging::logger cdc_log;
@@ -91,12 +93,31 @@ schema_ptr cdc_generations() {
/* A user-facing table providing identifiers of the streams used in CDC generations. */
schema_ptr cdc_desc() {
thread_local auto schema = [] {
auto id = generate_legacy_id(system_distributed_keyspace::NAME, system_distributed_keyspace::CDC_DESC);
return schema_builder(system_distributed_keyspace::NAME, system_distributed_keyspace::CDC_DESC, {id})
auto id = generate_legacy_id(system_distributed_keyspace::NAME, system_distributed_keyspace::CDC_DESC_V2);
return schema_builder(system_distributed_keyspace::NAME, system_distributed_keyspace::CDC_DESC_V2, {id})
/* The timestamp of this CDC generation. */
.with_column("time", timestamp_type, column_kind::partition_key)
/* The set of stream identifiers used in this CDC generation. */
/* For convenience, the list of stream IDs in this generation is split into token ranges
* which the stream IDs were mapped to (by the partitioner) when the generation was created. */
.with_column("range_end", long_type, column_kind::clustering_key)
/* The set of stream identifiers used in this CDC generation for the token range
* ending on `range_end`. */
.with_column("streams", cdc_streams_set_type)
.with_version(system_keyspace::generate_schema_version(id))
.build();
}();
return schema;
}
/* A user-facing table providing CDC generation timestamps. */
schema_ptr cdc_timestamps() {
thread_local auto schema = [] {
auto id = generate_legacy_id(system_distributed_keyspace::NAME, system_distributed_keyspace::CDC_TIMESTAMPS);
return schema_builder(system_distributed_keyspace::NAME, system_distributed_keyspace::CDC_TIMESTAMPS, {id})
/* This is a single-partition table. The partition key is always "timestamps". */
.with_column("key", utf8_type, column_kind::partition_key)
/* The timestamp of this CDC generation. */
.with_column("time", reversed_type_impl::get_instance(timestamp_type), column_kind::clustering_key)
/* Expiration time of this CDC generation (or null if not expired). */
.with_column("expired", timestamp_type)
.with_version(system_keyspace::generate_schema_version(id))
@@ -105,11 +126,14 @@ schema_ptr cdc_desc() {
return schema;
}
static const sstring CDC_TIMESTAMPS_KEY = "timestamps";
static std::vector<schema_ptr> all_tables() {
return {
view_build_status(),
cdc_generations(),
cdc_desc(),
cdc_timestamps(),
};
}
@@ -117,13 +141,15 @@ bool system_distributed_keyspace::is_extra_durable(const sstring& cf_name) {
return cf_name == CDC_TOPOLOGY_DESCRIPTION;
}
system_distributed_keyspace::system_distributed_keyspace(cql3::query_processor& qp, service::migration_manager& mm)
system_distributed_keyspace::system_distributed_keyspace(cql3::query_processor& qp, service::migration_manager& mm, service::storage_proxy& sp)
: _qp(qp)
, _mm(mm) {
, _mm(mm)
, _sp(sp) {
}
future<> system_distributed_keyspace::start() {
if (this_shard_id() != 0) {
_started = true;
return make_ready_future<>();
}
@@ -148,18 +174,18 @@ future<> system_distributed_keyspace::start() {
});
});
});
});
}).then([this] { _started = true; });
}
future<> system_distributed_keyspace::stop() {
return make_ready_future<>();
}
static const timeout_config internal_distributed_timeout_config = [] {
using namespace std::chrono_literals;
const auto t = 10s;
static timeout_config get_timeout_config(db::timeout_clock::duration t) {
return timeout_config{ t, t, t, t, t, t, t };
}();
}
static const timeout_config internal_distributed_timeout_config = get_timeout_config(std::chrono::seconds(10));
future<std::unordered_map<utils::UUID, sstring>> system_distributed_keyspace::view_status(sstring ks_name, sstring view_name) const {
return _qp.execute_internal(
@@ -326,24 +352,69 @@ system_distributed_keyspace::expire_cdc_topology_description(
false).discard_result();
}
static set_type_impl::native_type prepare_cdc_streams(const std::vector<cdc::stream_id>& streams) {
set_type_impl::native_type ret;
for (auto& s: streams) {
ret.push_back(data_value(s.to_bytes()));
static future<std::vector<mutation>> get_cdc_streams_descriptions_v2_mutation(
const database& db,
db_clock::time_point time,
const cdc::topology_description& desc) {
auto s = db.find_schema(system_distributed_keyspace::NAME, system_distributed_keyspace::CDC_DESC_V2);
auto ts = api::new_timestamp();
std::vector<mutation> res;
res.emplace_back(s, partition_key::from_singular(*s, time));
size_t size_estimate = 0;
for (auto& e : desc.entries()) {
// We want to keep each mutation below ~1 MB.
if (size_estimate >= 1000 * 1000) {
res.emplace_back(s, partition_key::from_singular(*s, time));
size_estimate = 0;
}
set_type_impl::native_type streams;
streams.reserve(e.streams.size());
for (auto& stream : e.streams) {
streams.push_back(data_value(stream.to_bytes()));
}
// We estimate 20 bytes per stream ID.
// Stream IDs themselves weigh 16 bytes each (2 * sizeof(int64_t))
// but there's metadata to be taken into account.
// It has been verified experimentally that 20 bytes per stream ID is a good estimate.
size_estimate += e.streams.size() * 20;
res.back().set_cell(clustering_key::from_singular(*s, dht::token::to_int64(e.token_range_end)),
to_bytes("streams"), make_set_value(cdc_streams_set_type, std::move(streams)), ts);
co_await make_ready_future<>(); // maybe yield
}
return ret;
co_return res;
}
future<>
system_distributed_keyspace::create_cdc_desc(
db_clock::time_point time,
const std::vector<cdc::stream_id>& streams,
const cdc::topology_description& desc,
context ctx) {
return _qp.execute_internal(
format("INSERT INTO {}.{} (time, streams) VALUES (?,?)", NAME, CDC_DESC),
using namespace std::chrono_literals;
auto ms = co_await get_cdc_streams_descriptions_v2_mutation(_qp.db(), time, desc);
co_await max_concurrent_for_each(ms, 20, [&] (mutation& m) -> future<> {
// We use the storage_proxy::mutate API since CQL is not the best for handling large batches.
co_await _sp.mutate(
{ std::move(m) },
quorum_if_many(ctx.num_token_owners),
db::timeout_clock::now() + 10s,
nullptr, // trace_state
empty_service_permit(),
false // raw_counters
);
});
// Commit the description.
co_await _qp.execute_internal(
format("INSERT INTO {}.{} (key, time) VALUES (?, ?)", NAME, CDC_TIMESTAMPS),
quorum_if_many(ctx.num_token_owners),
internal_distributed_timeout_config,
{ time, make_set_value(cdc_streams_set_type, prepare_cdc_streams(streams)) },
{ CDC_TIMESTAMPS_KEY, time },
false).discard_result();
}
@@ -353,7 +424,7 @@ system_distributed_keyspace::expire_cdc_desc(
db_clock::time_point expiration_time,
context ctx) {
return _qp.execute_internal(
format("UPDATE {}.{} SET expired = ? WHERE time = ?", NAME, CDC_DESC),
format("UPDATE {}.{} SET expired = ? WHERE time = ?", NAME, CDC_TIMESTAMPS),
quorum_if_many(ctx.num_token_owners),
internal_distributed_timeout_config,
{ expiration_time, streams_ts },
@@ -364,11 +435,44 @@ future<bool>
system_distributed_keyspace::cdc_desc_exists(
db_clock::time_point streams_ts,
context ctx) {
return _qp.execute_internal(
format("SELECT time FROM {}.{} WHERE time = ?", NAME, CDC_DESC),
// Reading from this table on a freshly upgraded node that is the first to announce the CDC_TIMESTAMPS
// schema would most likely result in replicas refusing to return data, telling the node that they can't
// find the schema. Indeed, it takes some time for the nodes to synchronize their schema; schema is
// only eventually consistent.
//
// This problem doesn't occur on writes since writes enforce schema pull if the receiving replica
// notices that the write comes from an unknown schema, but it does occur on reads.
//
// Hence we work around it with a hack: we send a mutation with an empty partition to force our replicas
// to pull the schema.
//
// This is not strictly necessary; the code that calls this function does it in a retry loop
// so eventually, after the schema gets pulled, the read would succeed.
// Still, the errors are also unnecessary and if we can get rid of them - let's do it.
//
// FIXME: find a more elegant way to deal with this ``problem''.
if (!_forced_cdc_timestamps_schema_sync) {
using namespace std::chrono_literals;
auto s = _qp.db().find_schema(system_distributed_keyspace::NAME, system_distributed_keyspace::CDC_TIMESTAMPS);
mutation m(s, partition_key::from_singular(*s, CDC_TIMESTAMPS_KEY));
co_await _sp.mutate(
{ std::move(m) },
quorum_if_many(ctx.num_token_owners),
db::timeout_clock::now() + 10s,
nullptr, // trace_state
empty_service_permit(),
false // raw_counters
);
_forced_cdc_timestamps_schema_sync = true;
}
// At this point replicas know the schema, we can perform the actual read...
co_return co_await _qp.execute_internal(
format("SELECT time FROM {}.{} WHERE key = ? AND time = ?", NAME, CDC_TIMESTAMPS),
quorum_if_many(ctx.num_token_owners),
internal_distributed_timeout_config,
{ streams_ts },
{ CDC_TIMESTAMPS_KEY, streams_ts },
false
).then([] (::shared_ptr<cql3::untyped_result_set> cql_result) -> bool {
return !cql_result->empty() && cql_result->one().has("time");
@@ -376,27 +480,76 @@ system_distributed_keyspace::cdc_desc_exists(
}
future<std::map<db_clock::time_point, cdc::streams_version>>
system_distributed_keyspace::cdc_get_versioned_streams(context ctx) {
return _qp.execute_internal(
format("SELECT * FROM {}.{}", NAME, CDC_DESC),
system_distributed_keyspace::cdc_get_versioned_streams(db_clock::time_point not_older_than, context ctx) {
auto timestamps_cql = co_await _qp.execute_internal(
format("SELECT time FROM {}.{} WHERE key = ?", NAME, CDC_TIMESTAMPS),
quorum_if_many(ctx.num_token_owners),
internal_distributed_timeout_config,
{},
false
).then([] (::shared_ptr<cql3::untyped_result_set> cql_result) {
std::map<db_clock::time_point, cdc::streams_version> result;
{ CDC_TIMESTAMPS_KEY },
false);
for (auto& row : *cql_result) {
auto ts = row.get_as<db_clock::time_point>("time");
auto exp = row.get_opt<db_clock::time_point>("expired");
std::vector<cdc::stream_id> ids;
row.get_list_data<bytes>("streams", std::back_inserter(ids));
result.emplace(ts, cdc::streams_version(std::move(ids), ts, exp));
std::vector<db_clock::time_point> timestamps;
timestamps.reserve(timestamps_cql->size());
for (auto& row : *timestamps_cql) {
timestamps.push_back(row.get_as<db_clock::time_point>("time"));
}
// `time` is the table's clustering key, so the results are already sorted
auto first = std::lower_bound(timestamps.rbegin(), timestamps.rend(), not_older_than);
// need first gen _intersecting_ the timestamp.
if (first != timestamps.rbegin()) {
--first;
}
std::map<db_clock::time_point, cdc::streams_version> result;
co_await max_concurrent_for_each(first, timestamps.rend(), 5, [this, &ctx, &result] (db_clock::time_point ts) -> future<> {
auto streams_cql = co_await _qp.execute_internal(
format("SELECT streams FROM {}.{} WHERE time = ?", NAME, CDC_DESC_V2),
quorum_if_many(ctx.num_token_owners),
internal_distributed_timeout_config,
{ ts },
false);
utils::chunked_vector<cdc::stream_id> ids;
for (auto& row : *streams_cql) {
row.get_list_data<bytes>("streams", std::back_inserter(ids));
co_await make_ready_future<>(); // maybe yield
}
return result;
result.emplace(ts, cdc::streams_version{std::move(ids), ts});
});
co_return result;
}
future<db_clock::time_point>
system_distributed_keyspace::cdc_current_generation_timestamp(context ctx) {
auto timestamp_cql = co_await _qp.execute_internal(
format("SELECT time FROM {}.{} WHERE key = ? limit 1", NAME, CDC_TIMESTAMPS),
quorum_if_many(ctx.num_token_owners),
internal_distributed_timeout_config,
{ CDC_TIMESTAMPS_KEY },
false);
co_return timestamp_cql->one().get_as<db_clock::time_point>("time");
}
future<std::vector<db_clock::time_point>>
system_distributed_keyspace::get_cdc_desc_v1_timestamps(context ctx) {
std::vector<db_clock::time_point> res;
co_await _qp.query_internal(
format("SELECT time FROM {}.{}", NAME, CDC_DESC_V1),
quorum_if_many(ctx.num_token_owners),
// This is a long and expensive scan (mostly due to #8061).
// Give it a bit more time than usual.
get_timeout_config(std::chrono::seconds(60)),
{},
1000,
[&] (const cql3::untyped_result_set_row& r) {
res.push_back(r.get_as<db_clock::time_point>("time"));
return make_ready_future<stop_iteration>(stop_iteration::no);
});
co_return res;
}
}

View File

@@ -41,6 +41,10 @@ namespace cdc {
class streams_version;
} // namespace cdc
namespace service {
class storage_proxy;
}
namespace db {
class system_distributed_keyspace {
@@ -51,8 +55,16 @@ public:
/* Nodes use this table to communicate new CDC stream generations to other nodes. */
static constexpr auto CDC_TOPOLOGY_DESCRIPTION = "cdc_generation_descriptions";
/* This table is used by CDC clients to learn about avaliable CDC streams. */
static constexpr auto CDC_DESC = "cdc_streams_descriptions";
/* This table is used by CDC clients to learn about available CDC streams. */
static constexpr auto CDC_DESC_V2 = "cdc_streams_descriptions_v2";
/* Used by CDC clients to learn CDC generation timestamps. */
static constexpr auto CDC_TIMESTAMPS = "cdc_generation_timestamps";
/* Previous version of the "cdc_streams_descriptions_v2" table.
* We use it in the upgrade procedure to ensure that CDC generations appearing
* in the old table also appear in the new table, if necessary. */
static constexpr auto CDC_DESC_V1 = "cdc_streams_descriptions";
/* Information required to modify/query some system_distributed tables, passed from the caller. */
struct context {
@@ -62,17 +74,23 @@ public:
private:
cql3::query_processor& _qp;
service::migration_manager& _mm;
service::storage_proxy& _sp;
bool _started = false;
bool _forced_cdc_timestamps_schema_sync = false;
public:
/* Should writes to the given table always be synchronized by commitlog (flushed to disk)
* before being acknowledged? */
static bool is_extra_durable(const sstring& cf_name);
system_distributed_keyspace(cql3::query_processor&, service::migration_manager&);
system_distributed_keyspace(cql3::query_processor&, service::migration_manager&, service::storage_proxy&);
future<> start();
future<> stop();
bool started() const { return _started; }
future<std::unordered_map<utils::UUID, sstring>> view_status(sstring ks_name, sstring view_name) const;
future<> start_view_build(sstring ks_name, sstring view_name) const;
future<> finish_view_build(sstring ks_name, sstring view_name) const;
@@ -82,11 +100,18 @@ public:
future<std::optional<cdc::topology_description>> read_cdc_topology_description(db_clock::time_point streams_ts, context);
future<> expire_cdc_topology_description(db_clock::time_point streams_ts, db_clock::time_point expiration_time, context);
future<> create_cdc_desc(db_clock::time_point streams_ts, const std::vector<cdc::stream_id>&, context);
future<> create_cdc_desc(db_clock::time_point streams_ts, const cdc::topology_description&, context);
future<> expire_cdc_desc(db_clock::time_point streams_ts, db_clock::time_point expiration_time, context);
future<bool> cdc_desc_exists(db_clock::time_point streams_ts, context);
future<std::map<db_clock::time_point, cdc::streams_version>> cdc_get_versioned_streams(context);
/* Get all generation timestamps appearing in the "cdc_streams_descriptions" table
* (the old CDC stream description table). */
future<std::vector<db_clock::time_point>> get_cdc_desc_v1_timestamps(context);
future<std::map<db_clock::time_point, cdc::streams_version>> cdc_get_versioned_streams(db_clock::time_point not_older_than, context);
future<db_clock::time_point> cdc_current_generation_timestamp(context);
};
}

View File

@@ -1574,6 +1574,21 @@ future<> update_cdc_streams_timestamp(db_clock::time_point tp) {
.discard_result().then([] { return force_blocking_flush(v3::CDC_LOCAL); });
}
static const sstring CDC_REWRITTEN_KEY = "rewritten";
future<> cdc_set_rewritten(std::optional<db_clock::time_point> tp) {
if (tp) {
return qctx->execute_cql(
format("INSERT INTO system.{} (key, streams_timestamp) VALUES (?, ?)", v3::CDC_LOCAL),
CDC_REWRITTEN_KEY, *tp).discard_result();
} else {
// Insert just the row marker.
return qctx->execute_cql(
format("INSERT INTO system.{} (key) VALUES (?)", v3::CDC_LOCAL),
CDC_REWRITTEN_KEY).discard_result();
}
}
future<> force_blocking_flush(sstring cfname) {
assert(qctx);
return qctx->_qp.invoke_on_all([cfname = std::move(cfname)] (cql3::query_processor& qp) {
@@ -1646,6 +1661,14 @@ future<std::optional<db_clock::time_point>> get_saved_cdc_streams_timestamp() {
});
}
future<bool> cdc_is_rewritten() {
// We don't care about the actual timestamp; it's additional information for debugging purposes.
return qctx->execute_cql(format("SELECT key FROM system.{} WHERE key = ?", v3::CDC_LOCAL), CDC_REWRITTEN_KEY)
.then([] (::shared_ptr<cql3::untyped_result_set> msg) {
return !msg->empty();
});
}
bool bootstrap_complete() {
return get_bootstrap_state() == bootstrap_state::COMPLETED;
}
@@ -1864,7 +1887,7 @@ future<> get_compaction_history(compaction_history_consumer&& f) {
return do_with(compaction_history_consumer(std::move(f)),
[](compaction_history_consumer& consumer) mutable {
sstring req = format("SELECT * from system.{}", COMPACTION_HISTORY);
return qctx->qp().query(req, [&consumer] (const cql3::untyped_result_set::row& row) mutable {
return qctx->qp().query_internal(req, [&consumer] (const cql3::untyped_result_set::row& row) mutable {
compaction_history_entry entry;
entry.id = row.get_as<utils::UUID>("id");
entry.ks = row.get_as<sstring>("keyspace_name");

View File

@@ -634,5 +634,8 @@ future<> save_paxos_proposal(const schema& s, const service::paxos::proposal& pr
future<> save_paxos_decision(const schema& s, const service::paxos::proposal& decision, db::timeout_clock::time_point timeout);
future<> delete_paxos_decision(const schema& s, const partition_key& key, const utils::UUID& ballot, db::timeout_clock::time_point timeout);
future<bool> cdc_is_rewritten();
future<> cdc_set_rewritten(std::optional<db_clock::time_point>);
} // namespace system_keyspace
} // namespace db

View File

@@ -244,12 +244,12 @@ if __name__ == "__main__":
# and https://cloud.google.com/compute/docs/disks/local-ssd#nvme
# note that scylla iotune might measure more, this is GCP recommended
mbs=1024*1024
if nr_disks >= 1 & nr_disks < 4:
if nr_disks >= 1 and nr_disks < 4:
disk_properties["read_iops"] = 170000 * nr_disks
disk_properties["read_bandwidth"] = 660 * mbs * nr_disks
disk_properties["write_iops"] = 90000 * nr_disks
disk_properties["write_bandwidth"] = 350 * mbs * nr_disks
elif nr_disks >= 4 & nr_disks <= 8:
elif nr_disks >= 4 and nr_disks <= 8:
disk_properties["read_iops"] = 680000
disk_properties["read_bandwidth"] = 2650 * mbs
disk_properties["write_iops"] = 360000
@@ -281,3 +281,5 @@ if __name__ == "__main__":
run_iotune()
else:
run_iotune()
os.chmod(etcdir() + '/scylla.d/io_properties.yaml', 0o644)
os.chmod(etcdir() + '/scylla.d/io.conf', 0o644)

View File

@@ -82,6 +82,7 @@ def create_perftune_conf(cfg):
yaml = run('/opt/scylladb/scripts/perftune.py ' + params, shell=True, check=True, capture_output=True, encoding='utf-8').stdout.strip()
with open('/etc/scylla.d/perftune.yaml', 'w') as f:
f.write(yaml)
os.chmod('/etc/scylla.d/perftune.yaml', 0o644)
return True
else:
return False

View File

@@ -27,6 +27,7 @@ import grp
import sys
import stat
import distro
from pathlib import Path
from scylla_util import *
from subprocess import run
@@ -85,8 +86,14 @@ if __name__ == '__main__':
raiddevs_to_try = [args.raiddev, ]
for fsdev in raiddevs_to_try:
raiddevname = os.path.basename(fsdev)
if not os.path.exists(f'/sys/block/{raiddevname}/md/array_state'):
array_state = Path(f'/sys/block/{raiddevname}/md/array_state')
# mdX is not allocated
if not array_state.exists():
break
with array_state.open() as f:
# allocated, but no devices, not running
if f.read().strip() == 'clear':
break
print(f'{fsdev} is already using')
else:
if args.raiddev is None:

View File

@@ -176,11 +176,6 @@ def warn_offline(setup):
def warn_offline_missing_pkg(setup, pkg):
colorprint('{red}{setup} disabled by default, since {pkg} not available.{nocolor}', setup=setup, pkg=pkg)
def current_umask():
current = os.umask(0)
os.umask(current)
return current
if __name__ == '__main__':
if not is_nonroot() and os.getuid() > 0:
print('Requires root permission.')
@@ -331,12 +326,6 @@ if __name__ == '__main__':
selinux_reboot_required = False
set_clocksource = False
umask = current_umask()
# files have to be world-readable
if not is_nonroot() and (umask & 0o7) != 0o2:
colorprint('{red}Scylla does not work with current umask setting ({umask}),\nplease restore umask to the default value (0022).{nocolor}', umask='{0:o}'.format(umask).zfill(4))
sys.exit(1)
if interactive:
colorprint('{green}Skip any of the following steps by answering \'no\'{nocolor}')
@@ -375,11 +364,13 @@ if __name__ == '__main__':
if version_check:
with open('/etc/scylla.d/housekeeping.cfg', 'w') as f:
f.write('[housekeeping]\ncheck-version: True\n')
os.chmod('/etc/scylla.d/housekeeping.cfg', 0o644)
systemd_unit('scylla-housekeeping-daily.timer').unmask()
systemd_unit('scylla-housekeeping-restart.timer').unmask()
else:
with open('/etc/scylla.d/housekeeping.cfg', 'w') as f:
f.write('[housekeeping]\ncheck-version: False\n')
os.chmod('/etc/scylla.d/housekeeping.cfg', 0o644)
hk_daily = systemd_unit('scylla-housekeeping-daily.timer')
hk_daily.mask()
hk_daily.stop()

View File

@@ -315,9 +315,10 @@ class gcp_instance:
logging.warning(
"This machine doesn't have enough CPUs for allocated number of NVMEs (at least 32 cpus for >=16 disks). Performance will suffer.")
return False
diskSize = self.firstNvmeSize
if diskCount < 1:
logging.warning("No ephemeral disks were found.")
return False
diskSize = self.firstNvmeSize
max_disktoramratio = 105
# 30:1 Disk/RAM ratio must be kept at least(AWS), we relax this a little bit
# on GCP we are OK with {max_disktoramratio}:1 , n1-standard-2 can cope with 1 disk, not more
@@ -380,6 +381,8 @@ class aws_instance:
raise Exception("found more than one disk mounted at root'".format(root_dev_candidates))
root_dev = root_dev_candidates[0].device
if root_dev == '/dev/root':
root_dev = run('findmnt -n -o SOURCE /', shell=True, check=True, capture_output=True, encoding='utf-8').stdout.strip()
nvmes_present = list(filter(nvme_re.match, os.listdir("/dev")))
return {"root": [ root_dev ], "ephemeral": [ x for x in nvmes_present if not root_dev.startswith(os.path.join("/dev/", x)) ] }

View File

@@ -29,11 +29,11 @@ ifeq ($(product),scylla)
dh_installinit --no-start
else
dh_installinit --no-start --name scylla-server
dh_installinit --no-start --name scylla-node-exporter
endif
dh_installinit --no-start --name scylla-housekeeping-daily
dh_installinit --no-start --name scylla-housekeeping-restart
dh_installinit --no-start --name scylla-fstrim
dh_installinit --no-start --name node-exporter
override_dh_strip:
# The binaries (ethtool...patchelf) don't pass dh_strip after going through patchelf. Since they are

View File

@@ -5,8 +5,8 @@ MAINTAINER Avi Kivity <avi@cloudius-systems.com>
ENV container docker
# The SCYLLA_REPO_URL argument specifies the URL to the RPM repository this Docker image uses to install Scylla. The default value is the Scylla's unstable RPM repository, which contains the daily build.
ARG SCYLLA_REPO_URL=http://downloads.scylladb.com/rpm/unstable/centos/master/latest/scylla.repo
ARG VERSION=666.development
ARG SCYLLA_REPO_URL=http://downloads.scylladb.com/rpm/unstable/centos/branch-4.4/latest/scylla.repo
ARG VERSION=4.4
ADD scylla_bashrc /scylla_bashrc

View File

@@ -26,26 +26,31 @@ fi
print_usage() {
echo "build_offline_installer.sh --repo [URL]"
echo " --repo repository for fetching scylla rpm, specify .repo file URL"
echo " --releasever use specific minor version of the distribution repo (ex: 7.4)"
echo " --image [IMAGE] Use the specified docker IMAGE"
echo " --no-docker Build offline installer without using docker"
exit 1
}
is_rhel7_variant() {
[ "$ID" = "rhel" -o "$ID" = "ol" -o "$ID" = "centos" ] && [[ "$VERSION_ID" =~ ^7 ]]
}
here="$(realpath $(dirname "$0"))"
releasever=`rpm -q --provides $(rpm -q --whatprovides "system-release(releasever)") | grep "system-release(releasever)"| uniq | cut -d ' ' -f 3`
REPO=
RELEASEVER=
IMAGE=docker.io/centos:7
NO_DOCKER=false
while [ $# -gt 0 ]; do
case "$1" in
"--repo")
REPO=$2
shift 2
;;
"--releasever")
RELEASEVER=$2
"--image")
IMAGE=$2
shift 2
;;
"--no-docker")
NO_DOCKER=true
shift 1
;;
*)
print_usage
;;
@@ -59,25 +64,17 @@ if [ -z $REPO ]; then
exit 1
fi
if ! is_rhel7_variant; then
echo "Unsupported distribution"
exit 1
fi
if [ "$ID" = "centos" ]; then
if [ ! -f /etc/yum.repos.d/epel.repo ]; then
sudo yum install -y epel-release
if ! $NO_DOCKER; then
if [[ -f ~/.config/scylladb/dbuild ]]; then
. ~/.config/scylladb/dbuild
fi
RELEASE=7
else
if [ ! -f /etc/yum.repos.d/epel.repo ]; then
sudo rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
if which docker >/dev/null 2>&1 ; then
tool=${DBUILD_TOOL-docker}
elif which podman >/dev/null 2>&1 ; then
tool=${DBUILD_TOOL-podman}
else
echo "Please make sure you install either podman or docker on this machine to run dbuild" && exit 1
fi
RELEASE=7Server
fi
if [ ! -f /usr/bin/yumdownloader ]; then
sudo yum -y install yum-utils
fi
if [ ! -f /usr/bin/wget ]; then
@@ -85,29 +82,55 @@ if [ ! -f /usr/bin/wget ]; then
fi
if [ ! -f /usr/bin/makeself ]; then
sudo yum -y install makeself
if $NO_DOCKER; then
# makeself on EPEL7 is too old, borrow it from EPEL8
# since there is no dependency on the package, it just work
if [ $release_major = '7' ]; then
sudo rpm --import https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-8
sudo cp "$here"/lib/epel8.repo /etc/yum.repos.d/
YUM_OPTS="--enablerepo=epel8"
elif [ $release_major = '8' ]; then
yum -y install epel-release || yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
fi
fi
sudo yum -y install "$YUM_OPTS" makeself
fi
if [ ! -f /usr/bin/createrepo ]; then
sudo yum -y install createrepo
fi
sudo yum -y install yum-plugin-downloadonly
makeself_ver=$(makeself --version|cut -d ' ' -f 3|sed -e 's/\.//g')
if [ $makeself_ver -lt 240 ]; then
echo "$(makeself --version) is too old, please install 2.4.0 or later"
exit 1
fi
cd /etc/yum.repos.d/
sudo wget -N $REPO
cd -
sudo rm -rf build/installroot build/offline_installer build/scylla_offline_installer.sh
sudo rm -rf build/installroot build/offline_docker build/offline_installer build/scylla_offline_installer.sh
mkdir -p build/installroot
mkdir -p build/installroot/etc/yum/vars
sudo sh -c "echo $RELEASE >> build/installroot/etc/yum/vars/releasever"
mkdir -p build/offline_docker
wget "$REPO" -O build/offline_docker/scylla.repo
cp "$here"/lib/install_deps.sh build/offline_docker
cp "$here"/lib/Dockerfile.in build/offline_docker/Dockerfile
sed -i -e "s#@@IMAGE@@#$IMAGE#" build/offline_docker/Dockerfile
cd build/offline_docker
if $NO_DOCKER; then
sudo cp scylla.repo /etc/yum.repos.d/scylla.repo
sudo ./install_deps.sh
else
image_id=$($tool build -q .)
fi
cd -
mkdir -p build/offline_installer
cp dist/offline_installer/redhat/header build/offline_installer
if [ -n "$RELEASEVER" ]; then
YUMOPTS="--releasever=$RELEASEVER"
cp "$here"/lib/header build/offline_installer
if $NO_DOCKER; then
"$here"/lib/construct_offline_repo.sh
else
./tools/toolchain/dbuild --image "$image_id" -- "$here"/lib/construct_offline_repo.sh
fi
sudo yum -y install $YUMOPTS --downloadonly --installroot=`pwd`/build/installroot --downloaddir=build/offline_installer scylla sudo ntp ntpdate net-tools kernel-tools
(cd build/offline_installer; createrepo -v .)
(cd build; makeself offline_installer scylla_offline_installer.sh "Scylla offline package" ./header)
(cd build; makeself --keep-umask offline_installer scylla_offline_installer.sh "Scylla offline package" ./header)

View File

@@ -0,0 +1,5 @@
FROM @@IMAGE@@
ADD install_deps.sh install_deps.sh
RUN ./install_deps.sh
ADD scylla.repo /etc/yum.repos.d/scylla.repo
CMD /bin/bash

View File

@@ -0,0 +1,9 @@
#!/bin/bash -e
releasever=`rpm -q --provides $(rpm -q --whatprovides "system-release(releasever)") | grep "system-release(releasever)"| uniq | cut -d ' ' -f 3`
# Can ignore error since we only needed when files exists
cp /etc/yum/vars/* build/installroot/etc/yum/vars/ ||:
# run yum in non-root mode using fakeroot
fakeroot yum -y install --downloadonly --releasever="$releasever" --installroot=`pwd`/build/installroot --downloaddir=build/offline_installer scylla sudo chrony net-tools kernel-tools mdadm xfsprogs

View File

@@ -0,0 +1,7 @@
[epel8]
name=Extra Packages for Enterprise Linux 8 - $basearch
#baseurl=https://download.fedoraproject.org/pub/epel/8/Everything/$basearch
metalink=https://mirrors.fedoraproject.org/metalink?repo=epel-8&arch=$basearch&infra=$infra&content=$contentdir
enabled=0
gpgcheck=1
countme=1

View File

@@ -0,0 +1,12 @@
#!/bin/bash
. /etc/os-release
release_major=$(echo $VERSION_ID|sed -e 's/^\([0-9]*\)[^0-9]*.*/\1/')
if [ ! -f /etc/yum.repos.d/epel.repo ]; then
yum -y install epel-release || yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-"$release_major".noarch.rpm
fi
if [ ! -f /usr/bin/fakeroot ]; then
yum -y install fakeroot
fi

View File

@@ -87,25 +87,13 @@ progresses and compatibility continues to improve.
* UpdateTable: Not supported.
* ListTables: Supported.
### Item Operations
* GetItem: Support almost complete except that projection expressions can
only ask for top-level attributes.
* PutItem: Support almost complete except that condition expressions can
only refer to to-level attributes.
* UpdateItem: Nested documents are supported but updates to nested attributes
are not (e.g., `SET a.b[3].c=val`), and neither are nested attributes in
condition expressions.
* DeleteItem: Mostly works, but again does not support nested attributes
in condition expressions.
* GetItem, PutItem, UpdateItem, DeleteItem fully supported.
### Batch Operations
* BatchGetItem: Almost complete except that projection expressions can only
ask for top-level attributes.
* BatchWriteItem: Supported. Doesn't limit the number of items (DynamoDB
limits to 25) or size of items (400 KB) or total request size (16 MB).
* BatchGetItem, BatchWriteItem fully supported.
Doesn't limit the number of items (DynamoDB limits to 25) or size of items
(400 KB) or total request size (16 MB).
### Scans
Scan and Query are mostly supported, with the following limitations:
* As above, projection expressions only support top-level attributes.
* The ScanFilter/QueryFilter parameter for filtering results is fully
supported, but the newer FilterExpression syntax is not yet supported.
* The "Select" options which allows to count items instead of returning them
is not yet supported.
### Secondary Indexes
@@ -297,11 +285,10 @@ policies" section.
DynamoDB allows attributes to be **nested** - a top-level attribute may
be a list or a map, and each of its elements may further be lists or
maps, etc. Alternator currently stores the entire content of a top-level
attribute as one JSON object. This is good enough for most needs, except
one DynamoDB feature which we cannot support safely: we cannot modify
a non-top-level attribute (e.g., a.b[3].c) directly without RMW. We plan
to fix this in a future version by rethinking the data model we use for
attributes, or rethinking our implementation of RMW (as explained above).
attribute as one JSON object. This means that UpdateItem requests which
want modify a non-top-level attribute directly (e.g., a.b[3].c) need RMW:
Alternator implements such requests by reading the entire top-level
attribute a, modifying only a.b[3].c, and then writing back a.
```eval_rst
.. toctree::
@@ -309,4 +296,4 @@ attributes, or rethinking our implementation of RMW (as explained above).
getting-started
compatibility
```
```

View File

@@ -61,12 +61,6 @@ behave the same in Alternator. However, there are a few features which we have
not implemented yet. Unimplemented features return an error when used, so
they should be easy to detect. Here is a list of these unimplemented features:
* Missing support for **atribute paths** like `a.b[3].c`.
Nested attributes _are_ supported, but Alternator does not yet allow reading
or writing directly a piece of a nested attributes using an attribute path -
only top-level attributes can be read or written directly.
https://github.com/scylladb/scylla/issues/5024
* Currently in Alternator, a GSI (Global Secondary Index) can only be added
to a table at table creation time. Unlike DynamoDB which also allows adding
a GSI (but not an LSI) to an existing table using an UpdateTable operation.

View File

@@ -146,6 +146,15 @@ Next, the node starts gossiping the timestamp of the new generation together wit
}).get();
```
The node persists the currently gossiped timestamp in order to recover it on restart in the `system.cdc_local` table. This is the schema:
```
CREATE TABLE system.cdc_local (
key text PRIMARY KEY,
streams_timestamp timestamp
) ...
```
The timestamp is kept under the `"cdc_local"` key in the `streams_timestamp` column.
When other nodes learn about the generation, they'll extract it from the `cdc_generation_descriptions` table and save it using `cdc::metadata::insert(db_clock::time_point, topology_description&&)`.
Notice that nodes learn about the generation together with the new node's tokens. When they learn about its tokens they'll immediately start sending writes to the new node (in the case of bootstrapping, it will become a pending replica). But the old generation will still be operating for a minute or two. Thus colocation will be lost for a while. This problem will be fixed when the two-phase-commit approach is implemented.
@@ -157,9 +166,54 @@ Due to the need of maintaining colocation we don't allow the client to send writ
Suppose that a write is requested and the write coordinator's local clock has time `C` and the generation operating at time `C` has timestamp `T` (`T <= C`). Then we only allow the write if its timestamp is in the interval [`T`, `C + generation_leeway`), where `generation_leeway` is a small time-inteval constant (e.g. 5 seconds).
Reason: we cannot allow writes before `T`, because they belong to the old generation whose token ranges might no longer refine the current vnodes, so the corresponding log write would not necessarily be colocated with the base write. We also cannot allow writes too far "into the future" because we don't know what generation will be operating at that time (the node which will introduce this generation might not have joined yet). But, as mentioned before, we assume that we'll learn about the next generation in time. Again --- the need for this assumption will be gone in a future patch.
### Streams description table
### Streams description tables
The `cdc_streams_descriptions` table in the `system_distributed` keyspace allows CDC clients to learn about available sets of streams and the time intervals they are operating at. It's definition is as follows (db/system_distributed_keyspace.cc):
The `cdc_streams_descriptions_v2` table in the `system_distributed` keyspace allows CDC clients to learn about available sets of streams and the time intervals they are operating at. It's definition is as follows (db/system_distributed_keyspace.cc):
```
return schema_builder(system_distributed_keyspace::NAME, system_distributed_keyspace::CDC_DESC_V2, {id})
/* The timestamp of this CDC generation. */
.with_column("time", timestamp_type, column_kind::partition_key)
/* For convenience, the list of stream IDs in this generation is split into token ranges
* which the stream IDs were mapped to (by the partitioner) when the generation was created. */
.with_column("range_end", long_type, column_kind::clustering_key)
/* The set of stream identifiers used in this CDC generation for the token range
* ending on `range_end`. */
.with_column("streams", cdc_streams_set_type)
.with_version(system_keyspace::generate_schema_version(id))
.build();
```
where
```
thread_local data_type cdc_stream_tuple_type = tuple_type_impl::get_instance({long_type, long_type});
thread_local data_type cdc_streams_set_type = set_type_impl::get_instance(cdc_stream_tuple_type, false);
```
This table contains each generation's timestamp (as partition key) and the set of stream IDs used by this generation grouped by token ranges that the stream IDs are mapped to. It is meant to be user-facing, in contrast to `cdc_generation_descriptions` which is used internally.
There is a second table that contains just the generations' timestamps, `cdc_generation_timestamps`:
```
return schema_builder(system_distributed_keyspace::NAME, system_distributed_keyspace::CDC_TIMESTAMPS, {id})
/* This is a single-partition table. The partition key is always "timestamps". */
.with_column("key", utf8_type, column_kind::partition_key)
/* The timestamp of this CDC generation. */
.with_column("time", timestamp_type, column_kind::clustering_key)
/* Expiration time of this CDC generation (or null if not expired). */
.with_column("expired", timestamp_type)
.with_version(system_keyspace::generate_schema_version(id))
.build();
```
It is a single-partition table, containing the timestamps of generations found in `cdc_streams_descriptions_v2` in separate clustered rows. It allows clients to efficiently query if there are any new generations, e.g.:
```
SELECT time FROM system_distributed.cdc_generation_timestamps` WHERE time > X
```
where `X` is the last timestamp known by that particular client.
When nodes learn about a CDC generation through gossip, they race to update these description tables by first inserting the set of rows containing this generation's stream IDs into `cdc_streams_descriptions_v2` and then, if the node succeeds, by inserting its timestamp into `cdc_generation_timestamps` (see `cdc::update_streams_description`). This operation is idempotent so it doesn't matter if multiple nodes do it at the same time.
Note that the first phase of inserting stream IDs may fail in the middle; in that case, the partition for that generation may contain partial information. Thus a client can only safely read a partition from `cdc_streams_descriptions_v2` (i.e. without the risk of observing only a part of the stream IDs) if they first observe its timestamp in `cdc_generation_timestamps`.
### Streams description table V1 and rewriting
As the name suggests, `cdc_streams_descriptions_v2` is the second version of the streams description table. The previous schema was:
```
return schema_builder(system_distributed_keyspace::NAME, system_distributed_keyspace::CDC_DESC, {id})
/* The timestamp of this CDC generation. */
@@ -171,14 +225,26 @@ The `cdc_streams_descriptions` table in the `system_distributed` keyspace allows
.with_version(system_keyspace::generate_schema_version(id))
.build();
```
where
```
thread_local data_type cdc_stream_tuple_type = tuple_type_impl::get_instance({long_type, long_type});
thread_local data_type cdc_streams_set_type = set_type_impl::get_instance(cdc_stream_tuple_type, false);
```
This table simply contains each generation's timestamp (as partition key) and the set of stream IDs used by this generation. It is meant to be user-facing, in contrast to `cdc_generation_descriptions` which is used internally.
When nodes learn about a CDC generation through gossip, they race to update the description table by inserting a proper row (see `cdc::update_streams_description`). This operation is idempotent so it doesn't matter if multiple nodes do it at the same time.
The entire set of stream IDs (for all token ranges) was stored as a single collection. With large clusters the collection could grow quite big: for example, with 100 nodes 64 shards each and 256 vnodes per node, a new generation would contain 1,6M stream IDs, resulting in a ~32MB collection. For reasons described in issue #7993 this would disqualify the previous schema.
However, that was the schema used in the Scylla 4.3 release. For clusters that used CDC with this schema we need to ensure that stream descriptions residing in the old table appear in the new table as well (if necessary, i.e. if these streams may still contain some data).
To do that, we perform a rewrite procedure. Each node does the following on restart:
1. Check if the `system_distributed.cdc_streams_descriptions` table exists. If it doesn't, there's nothing to rewrite, so stop.
2. Check if the `system.cdc_local` table contains a row with `key = "rewritten"`. If it does then rewrite was already performed, so stop.
3. Check if there is a table with CDC enabled. If not, add a row with `key = "rewritten"` to `system.cdc_local` and stop; no rewriting is necessary (and won't be) since old generations - even if they exists - are not needed.
4. Retrieve all generation timestamps from the old streams description table by performing a full range scan: `select time from system_distributed.cdc_streams_descriptions`. This may be a long/expensive operation, hence it's performed in a background task (the procedure is moved to background in this step).
5. Filter out timestamps that are "too old". A generation timestamp is "too old" if there is a greater timestamp `T` such that for every table with CDC enabled, `now - ttl > T`, where `now` is the current time and `ttl` is the table's TTL setting. This means that the table cannot contain data that belongs to the "too old" generation. Thus, if each table passes this check for a given generation, that generation doesn't need to be rewritten.
6. For each timestamp that's left:
6.1 if it's already present in the new table, skip it (we check this by querying `cdc_generation_timestamps`
6.2 fetch the generation (by querying `cdc_generation_descriptions`)
6.3 insert the generation's streams into the new table
7. Insert a row with `key = "rewritten"` into `system.cdc_local`.
Note that every node will perform this procedure on upgrade, but there's a high chance that only one of them actually proceeds all the way to step 6.2 if upgrade is performed correctly, i.e. in a rolling fashion (nodes are restarted one-by-one).
In order to prevent new nodes to do the rewriting (we only want upgrading nodes to do it), we insert the `key = "rewritten"` row on bootstrap as well, before we start this procedure (so the node won't pass the second check).
#### TODO: expired generations
The `expired` column in `cdc_streams_descriptions` and `cdc_generation_descriptions` means that this generation was superseded by some new generation and will soon be removed (its table entry will be gone). This functionality is yet to be implemented.
The `expired` column in `cdc_generation_timestamps` and `cdc_generation_descriptions` means that this generation was superseded by some new generation and will soon be removed (its table entry will be gone). This functionality is yet to be implemented.

View File

@@ -1792,6 +1792,8 @@ future<> gossiper::do_shadow_round(std::unordered_set<gms::inet_address> nodes,
}).handle_exception_type([node, &fall_back_to_syn_msg] (seastar::rpc::unknown_verb_error&) {
logger.warn("Node {} does not support get_endpoint_states verb", node);
fall_back_to_syn_msg = true;
}).handle_exception_type([node, &nodes_down] (seastar::rpc::timeout_error&) {
logger.warn("The get_endpoint_states verb to node {} was timeout", node);
}).handle_exception_type([node, &nodes_down] (seastar::rpc::closed_error&) {
nodes_down++;
logger.warn("Node {} is down for get_endpoint_states verb", node);

View File

@@ -103,22 +103,3 @@ enum class repair_row_level_start_status: uint8_t {
struct repair_row_level_start_response {
repair_row_level_start_status status;
};
enum class node_ops_cmd : uint32_t {
removenode_prepare,
removenode_heartbeat,
removenode_sync_data,
removenode_abort,
removenode_done,
};
struct node_ops_cmd_request {
node_ops_cmd cmd;
utils::UUID ops_uuid;
std::list<gms::inet_address> ignore_nodes;
std::list<gms::inet_address> leaving_nodes;
};
struct node_ops_cmd_response {
bool ok;
};

View File

@@ -335,7 +335,6 @@ public:
void remove_bootstrap_tokens(std::unordered_set<token> tokens);
void add_leaving_endpoint(inet_address endpoint);
void del_leaving_endpoint(inet_address endpoint);
public:
void remove_endpoint(inet_address endpoint);
#if 0
@@ -1658,10 +1657,6 @@ void token_metadata_impl::add_leaving_endpoint(inet_address endpoint) {
_leaving_endpoints.emplace(endpoint);
}
void token_metadata_impl::del_leaving_endpoint(inet_address endpoint) {
_leaving_endpoints.erase(endpoint);
}
void token_metadata_impl::add_replacing_endpoint(inet_address existing_node, inet_address replacing_node) {
tlogger.info("Added node {} as pending replacing endpoint which replaces existing node {}",
replacing_node, existing_node);
@@ -1932,11 +1927,6 @@ token_metadata::add_leaving_endpoint(inet_address endpoint) {
_impl->add_leaving_endpoint(endpoint);
}
void
token_metadata::del_leaving_endpoint(inet_address endpoint) {
_impl->del_leaving_endpoint(endpoint);
}
void
token_metadata::remove_endpoint(inet_address endpoint) {
_impl->remove_endpoint(endpoint);

View File

@@ -238,7 +238,6 @@ public:
void remove_bootstrap_tokens(std::unordered_set<token> tokens);
void add_leaving_endpoint(inet_address endpoint);
void del_leaving_endpoint(inet_address endpoint);
void remove_endpoint(inet_address endpoint);

View File

@@ -1063,7 +1063,7 @@ int main(int ac, char** av) {
gms::stop_gossiping().get();
});
sys_dist_ks.start(std::ref(qp), std::ref(mm)).get();
sys_dist_ks.start(std::ref(qp), std::ref(mm), std::ref(proxy)).get();
ss.init_server().get();
sst_format_selector.sync();

View File

@@ -477,6 +477,7 @@ static constexpr unsigned do_get_rpc_client_idx(messaging_verb verb) {
// as well as reduce latency as there are potentially many requests
// blocked on schema version request.
case messaging_verb::GOSSIP_DIGEST_SYN:
case messaging_verb::GOSSIP_DIGEST_ACK:
case messaging_verb::GOSSIP_DIGEST_ACK2:
case messaging_verb::GOSSIP_SHUTDOWN:
case messaging_verb::GOSSIP_ECHO:
@@ -504,7 +505,6 @@ static constexpr unsigned do_get_rpc_client_idx(messaging_verb verb) {
case messaging_verb::REPAIR_GET_ROW_DIFF_WITH_RPC_STREAM:
case messaging_verb::REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM:
case messaging_verb::REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM:
case messaging_verb::NODE_OPS_CMD:
case messaging_verb::HINT_MUTATION:
return 1;
case messaging_verb::CLIENT_ID:
@@ -512,7 +512,6 @@ static constexpr unsigned do_get_rpc_client_idx(messaging_verb verb) {
case messaging_verb::READ_DATA:
case messaging_verb::READ_MUTATION_DATA:
case messaging_verb::READ_DIGEST:
case messaging_verb::GOSSIP_DIGEST_ACK:
case messaging_verb::DEFINITIONS_UPDATE:
case messaging_verb::TRUNCATE:
case messaging_verb::MIGRATION_REQUEST:
@@ -1350,17 +1349,6 @@ future<std::vector<row_level_diff_detect_algorithm>> messaging_service::send_rep
return send_message<future<std::vector<row_level_diff_detect_algorithm>>>(this, messaging_verb::REPAIR_GET_DIFF_ALGORITHMS, std::move(id));
}
// Wrapper for NODE_OPS_CMD
void messaging_service::register_node_ops_cmd(std::function<future<node_ops_cmd_response> (const rpc::client_info& cinfo, node_ops_cmd_request)>&& func) {
register_handler(this, messaging_verb::NODE_OPS_CMD, std::move(func));
}
future<> messaging_service::unregister_node_ops_cmd() {
return unregister_handler(messaging_verb::NODE_OPS_CMD);
}
future<node_ops_cmd_response> messaging_service::send_node_ops_cmd(msg_addr id, node_ops_cmd_request req) {
return send_message<future<node_ops_cmd_response>>(this, messaging_verb::NODE_OPS_CMD, std::move(id), std::move(req));
}
void
messaging_service::register_paxos_prepare(std::function<future<foreign_ptr<std::unique_ptr<service::paxos::prepare_response>>>(
const rpc::client_info&, rpc::opt_time_point, query::read_command cmd, partition_key key, utils::UUID ballot,

View File

@@ -143,8 +143,7 @@ enum class messaging_verb : int32_t {
HINT_MUTATION = 42,
PAXOS_PRUNE = 43,
GOSSIP_GET_ENDPOINT_STATES = 44,
NODE_OPS_CMD = 45,
LAST = 46,
LAST = 45,
};
} // namespace netw
@@ -395,11 +394,6 @@ public:
future<> unregister_repair_get_diff_algorithms();
future<std::vector<row_level_diff_detect_algorithm>> send_repair_get_diff_algorithms(msg_addr id);
// Wrapper for NODE_OPS_CMD
void register_node_ops_cmd(std::function<future<node_ops_cmd_response> (const rpc::client_info& cinfo, node_ops_cmd_request)>&& func);
future<> unregister_node_ops_cmd();
future<node_ops_cmd_response> send_node_ops_cmd(msg_addr id, node_ops_cmd_request);
// Wrapper for GOSSIP_ECHO verb
void register_gossip_echo(std::function<future<> ()>&& func);
future<> unregister_gossip_echo();

View File

@@ -0,0 +1,52 @@
/*
* Copyright (C) 2021 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "feed_writers.hh"
namespace mutation_writer {
bucket_writer::bucket_writer(schema_ptr schema, std::pair<flat_mutation_reader, queue_reader_handle> queue_reader, reader_consumer& consumer)
: _schema(schema)
, _handle(std::move(queue_reader.second))
, _consume_fut(consumer(std::move(queue_reader.first)))
{ }
bucket_writer::bucket_writer(schema_ptr schema, reader_permit permit, reader_consumer& consumer)
: bucket_writer(schema, make_queue_reader(schema, std::move(permit)), consumer)
{ }
future<> bucket_writer::consume(mutation_fragment mf) {
return _handle.push(std::move(mf));
}
void bucket_writer::consume_end_of_stream() {
_handle.push_end_of_stream();
}
void bucket_writer::abort(std::exception_ptr ep) noexcept {
_handle.abort(std::move(ep));
}
future<> bucket_writer::close() noexcept {
return std::move(_consume_fut);
}
} // mutation_writer

View File

@@ -22,10 +22,31 @@
#pragma once
#include "flat_mutation_reader.hh"
#include "mutation_reader.hh"
namespace mutation_writer {
using reader_consumer = noncopyable_function<future<> (flat_mutation_reader)>;
class bucket_writer {
schema_ptr _schema;
queue_reader_handle _handle;
future<> _consume_fut;
private:
bucket_writer(schema_ptr schema, std::pair<flat_mutation_reader, queue_reader_handle> queue_reader, reader_consumer& consumer);
public:
bucket_writer(schema_ptr schema, reader_permit permit, reader_consumer& consumer);
future<> consume(mutation_fragment mf);
void consume_end_of_stream();
void abort(std::exception_ptr ep) noexcept;
future<> close() noexcept;
};
template <typename Writer>
requires MutationFragmentConsumer<Writer, future<>>
future<> feed_writer(flat_mutation_reader&& rd, Writer&& wr) {
@@ -40,9 +61,17 @@ future<> feed_writer(flat_mutation_reader&& rd, Writer&& wr) {
if (f.failed()) {
auto ex = f.get_exception();
wr.abort(ex);
return make_exception_future<>(ex);
return wr.close().then_wrapped([ex = std::move(ex)] (future<> f) mutable {
if (f.failed()) {
// The consumer is expected to fail when aborted,
// so just ignore any exception.
(void)f.get_exception();
}
return make_exception_future<>(std::move(ex));
});
} else {
return wr.consume_end_of_stream();
wr.consume_end_of_stream();
return wr.close();
}
});
});

View File

@@ -31,36 +31,7 @@
namespace mutation_writer {
class shard_based_splitting_mutation_writer {
class shard_writer {
queue_reader_handle _handle;
future<> _consume_fut;
private:
shard_writer(schema_ptr schema, std::pair<flat_mutation_reader, queue_reader_handle> queue_reader, reader_consumer& consumer)
: _handle(std::move(queue_reader.second))
, _consume_fut(consumer(std::move(queue_reader.first))) {
}
public:
shard_writer(schema_ptr schema, reader_permit permit, reader_consumer& consumer)
: shard_writer(schema, make_queue_reader(schema, std::move(permit)), consumer) {
}
future<> consume(mutation_fragment mf) {
return _handle.push(std::move(mf));
}
future<> consume_end_of_stream() {
// consume_end_of_stream is always called from a finally block,
// and that's because we wait for _consume_fut to return. We
// don't want to generate another exception here if the read was
// aborted.
if (!_handle.is_terminated()) {
_handle.push_end_of_stream();
}
return std::move(_consume_fut);
}
void abort(std::exception_ptr ep) {
_handle.abort(ep);
}
};
using shard_writer = bucket_writer;
private:
schema_ptr _schema;
@@ -105,13 +76,12 @@ public:
return write_to_shard(mutation_fragment(*_schema, _permit, std::move(pe)));
}
future<> consume_end_of_stream() {
return parallel_for_each(_shards, [] (std::optional<shard_writer>& shard) {
if (!shard) {
return make_ready_future<>();
void consume_end_of_stream() {
for (auto& shard : _shards) {
if (shard) {
shard->consume_end_of_stream();
}
return shard->consume_end_of_stream();
});
}
}
void abort(std::exception_ptr ep) {
for (auto&& shard : _shards) {
@@ -120,6 +90,11 @@ public:
}
}
}
future<> close() noexcept {
return parallel_for_each(_shards, [] (std::optional<shard_writer>& shard) {
return shard ? shard->close() : make_ready_future<>();
});
}
};
future<> segregate_by_shard(flat_mutation_reader producer, reader_consumer consumer) {

View File

@@ -109,22 +109,12 @@ small_flat_map<Key, Value, Size>::find(const key_type& k) {
class timestamp_based_splitting_mutation_writer {
using bucket_id = int64_t;
class bucket_writer {
schema_ptr _schema;
queue_reader_handle _handle;
future<> _consume_fut;
class timestamp_bucket_writer : public bucket_writer {
bool _has_current_partition = false;
private:
bucket_writer(schema_ptr schema, std::pair<flat_mutation_reader, queue_reader_handle> queue_reader, reader_consumer& consumer)
: _schema(std::move(schema))
, _handle(std::move(queue_reader.second))
, _consume_fut(consumer(std::move(queue_reader.first))) {
}
public:
bucket_writer(schema_ptr schema, reader_permit permit, reader_consumer& consumer)
: bucket_writer(schema, make_queue_reader(schema, std::move(permit)), consumer) {
timestamp_bucket_writer(schema_ptr schema, reader_permit permit, reader_consumer& consumer)
: bucket_writer(schema, std::move(permit), consumer) {
}
void set_has_current_partition() {
_has_current_partition = true;
@@ -135,18 +125,6 @@ class timestamp_based_splitting_mutation_writer {
bool has_current_partition() const {
return _has_current_partition;
}
future<> consume(mutation_fragment mf) {
return _handle.push(std::move(mf));
}
future<> consume_end_of_stream() {
if (!_handle.is_terminated()) {
_handle.push_end_of_stream();
}
return std::move(_consume_fut);
}
void abort(std::exception_ptr ep) {
_handle.abort(ep);
}
};
private:
@@ -155,7 +133,7 @@ private:
classify_by_timestamp _classifier;
reader_consumer _consumer;
partition_start _current_partition_start;
std::unordered_map<bucket_id, bucket_writer> _buckets;
std::unordered_map<bucket_id, timestamp_bucket_writer> _buckets;
std::vector<bucket_id> _buckets_used_for_current_partition;
private:
@@ -186,16 +164,21 @@ public:
future<> consume(range_tombstone&& rt);
future<> consume(partition_end&& pe);
future<> consume_end_of_stream() {
return parallel_for_each(_buckets, [] (std::pair<const bucket_id, bucket_writer>& bucket) {
return bucket.second.consume_end_of_stream();
});
void consume_end_of_stream() {
for (auto& b : _buckets) {
b.second.consume_end_of_stream();
}
}
void abort(std::exception_ptr ep) {
for (auto&& b : _buckets) {
b.second.abort(ep);
}
}
future<> close() noexcept {
return parallel_for_each(_buckets, [] (std::pair<const bucket_id, timestamp_bucket_writer>& b) {
return b.second.close();
});
}
};
future<> timestamp_based_splitting_mutation_writer::write_to_bucket(bucket_id bucket, mutation_fragment&& mf) {

View File

@@ -54,14 +54,6 @@ logging::logger rlogger("repair");
static sharded<netw::messaging_service>* _messaging;
void node_ops_info::check_abort() {
if (abort) {
auto msg = format("Node operation with ops_uuid={} is aborted", ops_uuid);
rlogger.warn("{}", msg);
throw std::runtime_error(msg);
}
}
class node_ops_metrics {
public:
node_ops_metrics() {
@@ -443,16 +435,6 @@ void tracker::abort_all_repairs() {
rlogger.info0("Aborted {} repair job(s)", count);
}
void tracker::abort_repair_node_ops(utils::UUID ops_uuid) {
for (auto& x : _repairs[this_shard_id()]) {
auto& ri = x.second;
if (ri->ops_uuid() && ri->ops_uuid().value() == ops_uuid) {
rlogger.info0("Aborted repair jobs for ops_uuid={}", ops_uuid);
ri->abort();
}
}
}
float tracker::report_progress(streaming::stream_reason reason) {
uint64_t nr_ranges_finished = 0;
uint64_t nr_ranges_total = 0;
@@ -811,8 +793,7 @@ repair_info::repair_info(seastar::sharded<database>& db_,
repair_uniq_id id_,
const std::vector<sstring>& data_centers_,
const std::vector<sstring>& hosts_,
streaming::stream_reason reason_,
std::optional<utils::UUID> ops_uuid)
streaming::stream_reason reason_)
: db(db_)
, messaging(ms_)
, sharder(get_sharder_for_tables(db_, keyspace_, table_ids_))
@@ -826,8 +807,7 @@ repair_info::repair_info(seastar::sharded<database>& db_,
, hosts(hosts_)
, reason(reason_)
, nr_ranges_total(ranges.size())
, _row_level_repair(db.local().features().cluster_supports_row_level_repair())
, _ops_uuid(std::move(ops_uuid)) {
, _row_level_repair(db.local().features().cluster_supports_row_level_repair()) {
}
future<> repair_info::do_streaming() {
@@ -1646,7 +1626,7 @@ static int do_repair_start(seastar::sharded<database>& db, seastar::sharded<netw
_node_ops_metrics.repair_total_ranges_sum += ranges.size();
auto ri = make_lw_shared<repair_info>(db, ms,
std::move(keyspace), std::move(ranges), std::move(table_ids),
id, std::move(data_centers), std::move(hosts), streaming::stream_reason::repair, id.uuid);
id, std::move(data_centers), std::move(hosts), streaming::stream_reason::repair);
return repair_ranges(ri);
});
repair_results.push_back(std::move(f));
@@ -1716,15 +1696,14 @@ static future<> sync_data_using_repair(seastar::sharded<database>& db,
sstring keyspace,
dht::token_range_vector ranges,
std::unordered_map<dht::token_range, repair_neighbors> neighbors,
streaming::stream_reason reason,
std::optional<utils::UUID> ops_uuid) {
streaming::stream_reason reason) {
if (ranges.empty()) {
return make_ready_future<>();
}
return smp::submit_to(0, [&db, &ms, keyspace = std::move(keyspace), ranges = std::move(ranges), neighbors = std::move(neighbors), reason, ops_uuid] () mutable {
return smp::submit_to(0, [&db, &ms, keyspace = std::move(keyspace), ranges = std::move(ranges), neighbors = std::move(neighbors), reason] () mutable {
repair_uniq_id id = repair_tracker().next_repair_command();
rlogger.info("repair id {} to sync data for keyspace={}, status=started", id, keyspace);
return repair_tracker().run(id, [id, &db, &ms, keyspace, ranges = std::move(ranges), neighbors = std::move(neighbors), reason, ops_uuid] () mutable {
return repair_tracker().run(id, [id, &db, &ms, keyspace, ranges = std::move(ranges), neighbors = std::move(neighbors), reason] () mutable {
auto cfs = list_column_families(db.local(), keyspace);
if (cfs.empty()) {
rlogger.warn("repair id {} to sync data for keyspace={}, no table in this keyspace", id, keyspace);
@@ -1734,12 +1713,12 @@ static future<> sync_data_using_repair(seastar::sharded<database>& db,
std::vector<future<>> repair_results;
repair_results.reserve(smp::count);
for (auto shard : boost::irange(unsigned(0), smp::count)) {
auto f = db.invoke_on(shard, [&db, &ms, keyspace, table_ids, id, ranges, neighbors, reason, ops_uuid] (database& localdb) mutable {
auto f = db.invoke_on(shard, [&db, &ms, keyspace, table_ids, id, ranges, neighbors, reason] (database& localdb) mutable {
auto data_centers = std::vector<sstring>();
auto hosts = std::vector<sstring>();
auto ri = make_lw_shared<repair_info>(db, ms,
std::move(keyspace), std::move(ranges), std::move(table_ids),
id, std::move(data_centers), std::move(hosts), reason, ops_uuid);
id, std::move(data_centers), std::move(hosts), reason);
ri->neighbors = std::move(neighbors);
return repair_ranges(ri);
});
@@ -1933,16 +1912,16 @@ future<> bootstrap_with_repair(seastar::sharded<database>& db, seastar::sharded<
}
}
auto nr_ranges = desired_ranges.size();
sync_data_using_repair(db, ms, keyspace_name, std::move(desired_ranges), std::move(range_sources), reason, {}).get();
sync_data_using_repair(db, ms, keyspace_name, std::move(desired_ranges), std::move(range_sources), reason).get();
rlogger.info("bootstrap_with_repair: finished with keyspace={}, nr_ranges={}", keyspace_name, nr_ranges);
}
rlogger.info("bootstrap_with_repair: finished with keyspaces={}", keyspaces);
});
}
static future<> do_decommission_removenode_with_repair(seastar::sharded<database>& db, seastar::sharded<netw::messaging_service>& ms, locator::token_metadata_ptr tmptr, gms::inet_address leaving_node, shared_ptr<node_ops_info> ops) {
static future<> do_decommission_removenode_with_repair(seastar::sharded<database>& db, seastar::sharded<netw::messaging_service>& ms, locator::token_metadata_ptr tmptr, gms::inet_address leaving_node) {
using inet_address = gms::inet_address;
return seastar::async([&db, &ms, tmptr = std::move(tmptr), leaving_node = std::move(leaving_node), ops] () mutable {
return seastar::async([&db, &ms, tmptr = std::move(tmptr), leaving_node = std::move(leaving_node)] () mutable {
auto myip = utils::fb_utilities::get_broadcast_address();
auto keyspaces = db.local().get_non_system_keyspaces();
bool is_removenode = myip != leaving_node;
@@ -2001,9 +1980,6 @@ static future<> do_decommission_removenode_with_repair(seastar::sharded<database
auto local_dc = get_local_dc();
bool find_node_in_local_dc_only = strat.get_type() == locator::replication_strategy_type::network_topology;
for (auto&r : ranges) {
if (ops) {
ops->check_abort();
}
auto end_token = r.end() ? r.end()->value() : dht::maximum_token();
const std::vector<inet_address> new_eps = ks.get_replication_strategy().calculate_natural_endpoints(end_token, temp, utils::can_yield::yes);
const std::vector<inet_address>& current_eps = current_replica_endpoints[r];
@@ -2085,12 +2061,6 @@ static future<> do_decommission_removenode_with_repair(seastar::sharded<database
}
neighbors_set.erase(myip);
neighbors_set.erase(leaving_node);
// Remove nodes in ignore_nodes
if (ops) {
for (const auto& node : ops->ignore_nodes) {
neighbors_set.erase(node);
}
}
auto neighbors = boost::copy_range<std::vector<gms::inet_address>>(neighbors_set |
boost::adaptors::filtered([&local_dc, &snitch_ptr] (const gms::inet_address& node) {
return snitch_ptr->get_datacenter(node) == local_dc;
@@ -2102,10 +2072,9 @@ static future<> do_decommission_removenode_with_repair(seastar::sharded<database
rlogger.debug("{}: keyspace={}, range={}, current_replica_endpoints={}, new_replica_endpoints={}, neighbors={}, skipped",
op, keyspace_name, r, current_eps, new_eps, neighbors);
} else {
std::vector<gms::inet_address> mandatory_neighbors = is_removenode ? neighbors : std::vector<gms::inet_address>{};
rlogger.info("{}: keyspace={}, range={}, current_replica_endpoints={}, new_replica_endpoints={}, neighbors={}, mandatory_neighbor={}",
op, keyspace_name, r, current_eps, new_eps, neighbors, mandatory_neighbors);
range_sources[r] = repair_neighbors(std::move(neighbors), std::move(mandatory_neighbors));
rlogger.debug("{}: keyspace={}, range={}, current_replica_endpoints={}, new_replica_endpoints={}, neighbors={}",
op, keyspace_name, r, current_eps, new_eps, neighbors);
range_sources[r] = repair_neighbors(std::move(neighbors));
if (is_removenode) {
ranges_for_removenode.push_back(r);
}
@@ -2125,8 +2094,7 @@ static future<> do_decommission_removenode_with_repair(seastar::sharded<database
ranges.swap(ranges_for_removenode);
}
auto nr_ranges_synced = ranges.size();
std::optional<utils::UUID> opt_uuid = ops ? std::make_optional<utils::UUID>(ops->ops_uuid) : std::nullopt;
sync_data_using_repair(db, ms, keyspace_name, std::move(ranges), std::move(range_sources), reason, opt_uuid).get();
sync_data_using_repair(db, ms, keyspace_name, std::move(ranges), std::move(range_sources), reason).get();
rlogger.info("{}: finished with keyspace={}, leaving_node={}, nr_ranges={}, nr_ranges_synced={}, nr_ranges_skipped={}",
op, keyspace_name, leaving_node, nr_ranges_total, nr_ranges_synced, nr_ranges_skipped);
}
@@ -2135,17 +2103,11 @@ static future<> do_decommission_removenode_with_repair(seastar::sharded<database
}
future<> decommission_with_repair(seastar::sharded<database>& db, seastar::sharded<netw::messaging_service>& ms, locator::token_metadata_ptr tmptr) {
return do_decommission_removenode_with_repair(db, ms, std::move(tmptr), utils::fb_utilities::get_broadcast_address(), {});
return do_decommission_removenode_with_repair(db, ms, std::move(tmptr), utils::fb_utilities::get_broadcast_address());
}
future<> removenode_with_repair(seastar::sharded<database>& db, seastar::sharded<netw::messaging_service>& ms, locator::token_metadata_ptr tmptr, gms::inet_address leaving_node, shared_ptr<node_ops_info> ops) {
return do_decommission_removenode_with_repair(db, ms, std::move(tmptr), std::move(leaving_node), std::move(ops));
}
future<> abort_repair_node_ops(utils::UUID ops_uuid) {
return smp::invoke_on_all([ops_uuid] {
return repair_tracker().abort_repair_node_ops(ops_uuid);
});
future<> removenode_with_repair(seastar::sharded<database>& db, seastar::sharded<netw::messaging_service>& ms, locator::token_metadata_ptr tmptr, gms::inet_address leaving_node) {
return do_decommission_removenode_with_repair(db, ms, std::move(tmptr), std::move(leaving_node));
}
static future<> do_rebuild_replace_with_repair(seastar::sharded<database>& db, seastar::sharded<netw::messaging_service>& ms, locator::token_metadata_ptr tmptr, sstring op, sstring source_dc, streaming::stream_reason reason) {
@@ -2220,7 +2182,7 @@ static future<> do_rebuild_replace_with_repair(seastar::sharded<database>& db, s
}).get();
}
auto nr_ranges = ranges.size();
sync_data_using_repair(db, ms, keyspace_name, std::move(ranges), std::move(range_sources), reason, {}).get();
sync_data_using_repair(db, ms, keyspace_name, std::move(ranges), std::move(range_sources), reason).get();
rlogger.info("{}: finished with keyspace={}, source_dc={}, nr_ranges={}", op, keyspace_name, source_dc, nr_ranges);
}
rlogger.info("{}: finished with keyspaces={}, source_dc={}", op, keyspaces, source_dc);
@@ -2258,19 +2220,12 @@ static future<> init_messaging_service_handler(sharded<database>& db, sharded<ne
return checksum_range(db, keyspace, cf, range, hv);
});
});
ms.register_node_ops_cmd([] (const rpc::client_info& cinfo, node_ops_cmd_request req) {
auto src_cpu_id = cinfo.retrieve_auxiliary<uint32_t>("src_cpu_id");
auto coordinator = cinfo.retrieve_auxiliary<gms::inet_address>("baddr");
return smp::submit_to(src_cpu_id % smp::count, [coordinator, req = std::move(req)] () mutable {
return service::get_local_storage_service().node_ops_cmd_handler(coordinator, std::move(req));
});
});
});
}
static future<> uninit_messaging_service_handler() {
return _messaging->invoke_on_all([] (auto& ms) {
return when_all_succeed(ms.unregister_repair_checksum_range(), ms.unregister_node_ops_cmd()).discard_result();
return ms.unregister_repair_checksum_range();
});
}

View File

@@ -76,22 +76,13 @@ struct repair_uniq_id {
};
std::ostream& operator<<(std::ostream& os, const repair_uniq_id& x);
struct node_ops_info {
utils::UUID ops_uuid;
bool abort = false;
std::list<gms::inet_address> ignore_nodes;
void check_abort();
};
// The tokens are the tokens assigned to the bootstrap node.
future<> bootstrap_with_repair(seastar::sharded<database>& db, seastar::sharded<netw::messaging_service>& ms, locator::token_metadata_ptr tmptr, std::unordered_set<dht::token> bootstrap_tokens);
future<> decommission_with_repair(seastar::sharded<database>& db, seastar::sharded<netw::messaging_service>& ms, locator::token_metadata_ptr tmptr);
future<> removenode_with_repair(seastar::sharded<database>& db, seastar::sharded<netw::messaging_service>& ms, locator::token_metadata_ptr tmptr, gms::inet_address leaving_node, shared_ptr<node_ops_info> ops);
future<> removenode_with_repair(seastar::sharded<database>& db, seastar::sharded<netw::messaging_service>& ms, locator::token_metadata_ptr tmptr, gms::inet_address leaving_node);
future<> rebuild_with_repair(seastar::sharded<database>& db, seastar::sharded<netw::messaging_service>& ms, locator::token_metadata_ptr tmptr, sstring source_dc);
future<> replace_with_repair(seastar::sharded<database>& db, seastar::sharded<netw::messaging_service>& ms, locator::token_metadata_ptr tmptr, std::unordered_set<dht::token> replacing_tokens);
future<> abort_repair_node_ops(utils::UUID ops_uuid);
// NOTE: repair_start() can be run on any node, but starts a node-global
// operation.
// repair_start() starts the requested repair on this node. It returns an
@@ -253,7 +244,6 @@ public:
bool _row_level_repair;
uint64_t _sub_ranges_nr = 0;
std::unordered_set<sstring> dropped_tables;
std::optional<utils::UUID> _ops_uuid;
public:
repair_info(seastar::sharded<database>& db_,
seastar::sharded<netw::messaging_service>& ms_,
@@ -263,8 +253,7 @@ public:
repair_uniq_id id_,
const std::vector<sstring>& data_centers_,
const std::vector<sstring>& hosts_,
streaming::stream_reason reason_,
std::optional<utils::UUID> ops_uuid);
streaming::stream_reason reason_);
future<> do_streaming();
void check_failed_ranges();
future<> request_transfer_ranges(const sstring& cf,
@@ -283,9 +272,6 @@ public:
const std::vector<sstring>& table_names() {
return cfs;
}
const std::optional<utils::UUID>& ops_uuid() const {
return _ops_uuid;
};
};
// The repair_tracker tracks ongoing repair operations and their progress.
@@ -338,7 +324,6 @@ public:
future<> run(repair_uniq_id id, std::function<void ()> func);
future<repair_status> repair_await_completion(int id, std::chrono::steady_clock::time_point timeout);
float report_progress(streaming::stream_reason reason);
void abort_repair_node_ops(utils::UUID ops_uuid);
};
future<uint64_t> estimate_partitions(seastar::sharded<database>& db, const sstring& keyspace,
@@ -479,27 +464,6 @@ enum class row_level_diff_detect_algorithm : uint8_t {
std::ostream& operator<<(std::ostream& out, row_level_diff_detect_algorithm algo);
enum class node_ops_cmd : uint32_t {
removenode_prepare,
removenode_heartbeat,
removenode_sync_data,
removenode_abort,
removenode_done,
};
// The cmd and ops_uuid are mandatory for each request.
// The ignore_nodes and leaving_node are optional.
struct node_ops_cmd_request {
node_ops_cmd cmd;
utils::UUID ops_uuid;
std::list<gms::inet_address> ignore_nodes;
std::list<gms::inet_address> leaving_nodes;
};
struct node_ops_cmd_response {
bool ok;
};
namespace std {
template<>
struct hash<partition_checksum> {

Submodule seastar updated: a287bb1a39...2c884a7449

View File

@@ -208,8 +208,9 @@ future<> service::client_state::has_access(const sstring& ks, auth::command_desc
if (cdc_topology_description_forbidden_permissions.contains(cmd.permission)) {
if (ks == db::system_distributed_keyspace::NAME
&& (resource_view.table() == db::system_distributed_keyspace::CDC_DESC
|| resource_view.table() == db::system_distributed_keyspace::CDC_TOPOLOGY_DESCRIPTION)) {
&& (resource_view.table() == db::system_distributed_keyspace::CDC_DESC_V2
|| resource_view.table() == db::system_distributed_keyspace::CDC_TOPOLOGY_DESCRIPTION
|| resource_view.table() == db::system_distributed_keyspace::CDC_TIMESTAMPS)) {
throw exceptions::unauthorized_exception(
format("Cannot {} {}", auth::permissions::to_string(cmd.permission), cmd.resource));
}

View File

@@ -4931,10 +4931,12 @@ void storage_proxy::init_messaging_service() {
tracing::trace(trace_state_ptr, "read_data: message received from /{}", src_addr.addr);
}
auto da = oda.value_or(query::digest_algorithm::MD5);
auto sp = get_local_shared_storage_proxy();
if (!cmd.max_result_size) {
cmd.max_result_size.emplace(cinfo.retrieve_auxiliary<uint64_t>("max_result_size"));
auto& cfg = sp->local_db().get_config();
cmd.max_result_size.emplace(cfg.max_memory_for_unlimited_query_soft_limit(), cfg.max_memory_for_unlimited_query_hard_limit());
}
return do_with(std::move(pr), get_local_shared_storage_proxy(), std::move(trace_state_ptr), [&cinfo, cmd = make_lw_shared<query::read_command>(std::move(cmd)), src_addr = std::move(src_addr), da, t] (::compat::wrapping_partition_range& pr, shared_ptr<storage_proxy>& p, tracing::trace_state_ptr& trace_state_ptr) mutable {
return do_with(std::move(pr), std::move(sp), std::move(trace_state_ptr), [&cinfo, cmd = make_lw_shared<query::read_command>(std::move(cmd)), src_addr = std::move(src_addr), da, t] (::compat::wrapping_partition_range& pr, shared_ptr<storage_proxy>& p, tracing::trace_state_ptr& trace_state_ptr) mutable {
p->get_stats().replica_data_reads++;
auto src_ip = src_addr.addr;
return get_schema_for_read(cmd->schema_version, std::move(src_addr), p->_messaging).then([cmd, da, &pr, &p, &trace_state_ptr, t] (schema_ptr s) {

View File

@@ -107,7 +107,6 @@ storage_service::storage_service(abort_source& abort_source, distributed<databas
, _service_memory_total(config.available_memory / 10)
, _service_memory_limiter(_service_memory_total)
, _for_testing(for_testing)
, _node_ops_abort_thread(node_ops_abort_thread())
, _shared_token_metadata(stm)
, _sys_dist_ks(sys_dist_ks)
, _view_update_generator(view_update_generator)
@@ -549,6 +548,10 @@ void storage_service::join_token_ring(int delay) {
if (!db::system_keyspace::bootstrap_complete()) {
// If we're not bootstrapping nor replacing, then we shouldn't have chosen a CDC streams timestamp yet.
assert(should_bootstrap() || db().local().is_replacing() || !_cdc_streams_ts);
// Don't try rewriting CDC stream description tables.
// See cdc.md design notes, `Streams description table V1 and rewriting` section, for explanation.
db::system_keyspace::cdc_set_rewritten(std::nullopt).get();
}
if (!_cdc_streams_ts) {
@@ -605,6 +608,14 @@ void storage_service::join_token_ring(int delay) {
// Retrieve the latest CDC generation seen in gossip (if any).
scan_cdc_generations();
// Ensure that the new CDC stream description table has all required streams.
// See the function's comment for details.
cdc::maybe_rewrite_streams_descriptions(
_db.local(), _sys_dist_ks.local_shared(),
[tm = get_token_metadata_ptr()] { return tm->count_normal_token_owners(); },
_abort_source).get();
}
void storage_service::mark_existing_views_as_built() {
@@ -741,7 +752,8 @@ void storage_service::handle_cdc_generation(std::optional<db_clock::time_point>
return;
}
if (!db::system_keyspace::bootstrap_complete() || !_sys_dist_ks.local_is_initialized()) {
if (!db::system_keyspace::bootstrap_complete() || !_sys_dist_ks.local_is_initialized()
|| !_sys_dist_ks.local().started()) {
// We still haven't finished the startup process.
// We will handle this generation in `scan_cdc_generations` (unless there's a newer one).
return;
@@ -1544,12 +1556,12 @@ void storage_service::set_gossip_tokens(
void storage_service::register_subscriber(endpoint_lifecycle_subscriber* subscriber)
{
_lifecycle_subscribers.emplace_back(subscriber);
_lifecycle_subscribers.add(subscriber);
}
void storage_service::unregister_subscriber(endpoint_lifecycle_subscriber* subscriber)
future<> storage_service::unregister_subscriber(endpoint_lifecycle_subscriber* subscriber) noexcept
{
_lifecycle_subscribers.erase(std::remove(_lifecycle_subscribers.begin(), _lifecycle_subscribers.end(), subscriber), _lifecycle_subscribers.end());
return _lifecycle_subscribers.remove(subscriber);
}
static std::optional<future<>> drain_in_progress;
@@ -1597,8 +1609,9 @@ future<> storage_service::drain_on_shutdown() {
get_storage_proxy().invoke_on_all([] (storage_proxy& local_proxy) mutable {
auto& ss = service::get_local_storage_service();
ss.unregister_subscriber(&local_proxy);
return ss.unregister_subscriber(&local_proxy).finally([&local_proxy] {
return local_proxy.drain_on_shutdown();
});
}).get();
slogger.info("Drain on shutdown: hints manager is stopped");
@@ -1750,12 +1763,9 @@ future<> storage_service::gossip_sharder() {
future<> storage_service::stop() {
// make sure nobody uses the semaphore
node_ops_singal_abort(std::nullopt);
return _service_memory_limiter.wait(_service_memory_total).finally([this] {
_listeners.clear();
return _schema_version_publisher.join();
}).finally([this] {
return std::move(_node_ops_abort_thread);
});
}
@@ -2177,192 +2187,102 @@ future<> storage_service::decommission() {
});
}
future<> storage_service::removenode(sstring host_id_string, std::list<gms::inet_address> ignore_nodes) {
return run_with_api_lock(sstring("removenode"), [host_id_string, ignore_nodes = std::move(ignore_nodes)] (storage_service& ss) mutable {
return seastar::async([&ss, host_id_string, ignore_nodes = std::move(ignore_nodes)] {
auto uuid = utils::make_random_uuid();
auto tmptr = ss.get_token_metadata_ptr();
future<> storage_service::removenode(sstring host_id_string) {
return run_with_api_lock(sstring("removenode"), [host_id_string] (storage_service& ss) mutable {
return seastar::async([&ss, host_id_string] {
slogger.debug("removenode: host_id = {}", host_id_string);
auto my_address = ss.get_broadcast_address();
auto tmlock = std::make_unique<token_metadata_lock>(ss.get_token_metadata_lock().get0());
auto tmptr = ss.get_mutable_token_metadata_ptr().get0();
auto local_host_id = tmptr->get_host_id(my_address);
auto host_id = utils::UUID(host_id_string);
auto endpoint_opt = tmptr->get_endpoint_for_host_id(host_id);
if (!endpoint_opt) {
throw std::runtime_error(format("removenode[{}]: Host ID not found in the cluster", uuid));
throw std::runtime_error("Host ID not found.");
}
auto endpoint = *endpoint_opt;
auto tokens = tmptr->get_tokens(endpoint);
auto leaving_nodes = std::list<gms::inet_address>{endpoint};
future<> heartbeat_updater = make_ready_future<>();
auto heartbeat_updater_done = make_lw_shared<bool>(false);
slogger.debug("removenode: endpoint = {}", endpoint);
// Step 1: Decide who needs to sync data
//
// By default, we require all nodes in the cluster to participate
// the removenode operation and sync data if needed. We fail the
// removenode operation if any of them is down or fails.
//
// If the user want the removenode opeartion to succeed even if some of the nodes
// are not available, the user has to explicitly pass a list of
// node that can be skipped for the operation.
std::vector<gms::inet_address> nodes;
for (const auto& x : tmptr->get_endpoint_to_host_id_map_for_reading()) {
seastar::thread::maybe_yield();
if (x.first != endpoint && std::find(ignore_nodes.begin(), ignore_nodes.end(), x.first) == ignore_nodes.end()) {
nodes.push_back(x.first);
if (endpoint == my_address) {
throw std::runtime_error("Cannot remove self");
}
if (ss._gossiper.get_live_members().contains(endpoint)) {
throw std::runtime_error(format("Node {} is alive and owns this ID. Use decommission command to remove it from the ring", endpoint));
}
// A leaving endpoint that is dead is already being removed.
if (tmptr->is_leaving(endpoint)) {
slogger.warn("Node {} is already being removed, continuing removal anyway", endpoint);
}
if (!ss._replicating_nodes.empty()) {
throw std::runtime_error("This node is already processing a removal. Wait for it to complete, or use 'removenode force' if this has failed.");
}
auto non_system_keyspaces = ss.db().local().get_non_system_keyspaces();
// Find the endpoints that are going to become responsible for data
for (const auto& keyspace_name : non_system_keyspaces) {
auto& ks = ss.db().local().find_keyspace(keyspace_name);
// if the replication factor is 1 the data is lost so we shouldn't wait for confirmation
if (ks.get_replication_strategy().get_replication_factor() == 1) {
slogger.warn("keyspace={} has replication factor 1, the data is probably lost", keyspace_name);
continue;
}
// get all ranges that change ownership (that is, a node needs
// to take responsibility for new range)
std::unordered_multimap<dht::token_range, inet_address> changed_ranges =
ss.get_changed_ranges_for_leaving(keyspace_name, endpoint);
for (auto& x: changed_ranges) {
auto ep = x.second;
if (ss._gossiper.is_alive(ep)) {
ss._replicating_nodes.emplace(ep);
} else {
slogger.warn("Endpoint {} is down and will not receive data for re-replication of {}", ep, endpoint);
}
}
}
slogger.info("removenode[{}]: Started removenode operation, removing node={}, sync_nodes={}, ignore_nodes={}", uuid, endpoint, nodes, ignore_nodes);
slogger.info("removenode: endpoint = {}, replicating_nodes = {}", endpoint, ss._replicating_nodes);
ss._removing_node = endpoint;
tmptr->add_leaving_endpoint(endpoint);
ss.update_pending_ranges(tmptr, format("removenode {}", endpoint)).get();
ss.replicate_to_all_cores(std::move(tmptr)).get();
tmlock.reset();
// Step 2: Prepare to sync data
std::unordered_set<gms::inet_address> nodes_unknown_verb;
std::unordered_set<gms::inet_address> nodes_down;
auto req = node_ops_cmd_request{node_ops_cmd::removenode_prepare, uuid, ignore_nodes, leaving_nodes};
try {
parallel_for_each(nodes, [&ss, &req, &nodes_unknown_verb, &nodes_down, uuid] (const gms::inet_address& node) {
return ss._messaging.local().send_node_ops_cmd(netw::msg_addr(node), req).then([uuid, node] (node_ops_cmd_response resp) {
slogger.debug("removenode[{}]: Got prepare response from node={}", uuid, node);
}).handle_exception_type([&nodes_unknown_verb, node, uuid] (seastar::rpc::unknown_verb_error&) {
slogger.warn("removenode[{}]: Node {} does not support removenode verb", uuid, node);
nodes_unknown_verb.emplace(node);
}).handle_exception_type([&nodes_down, node, uuid] (seastar::rpc::closed_error&) {
slogger.warn("removenode[{}]: Node {} is down for node_ops_cmd verb", uuid, node);
nodes_down.emplace(node);
});
}).get();
if (!nodes_unknown_verb.empty()) {
auto msg = format("removenode[{}]: Nodes={} do not support removenode verb. Please upgrade your cluster and run removenode again.", uuid, nodes_unknown_verb);
slogger.warn("{}", msg);
throw std::runtime_error(msg);
}
if (!nodes_down.empty()) {
auto msg = format("removenode[{}]: Nodes={} needed for removenode operation are down. It is highly recommended to fix the down nodes and try again. To proceed with best-effort mode which might cause data inconsistency, run nodetool removenode --ignore-dead-nodes <list_of_dead_nodes> <host_id>. E.g., nodetool removenode --ignore-dead-nodes 127.0.0.1,127.0.0.2 817e9515-316f-4fe3-aaab-b00d6f12dddd", uuid, nodes_down);
slogger.warn("{}", msg);
throw std::runtime_error(msg);
}
// the gossiper will handle spoofing this node's state to REMOVING_TOKEN for us
// we add our own token so other nodes to let us know when they're done
ss._gossiper.advertise_removing(endpoint, host_id, local_host_id).get();
// Step 3: Start heartbeat updater
heartbeat_updater = seastar::async([&ss, &nodes, uuid, heartbeat_updater_done] {
slogger.debug("removenode[{}]: Started heartbeat_updater", uuid);
while (!(*heartbeat_updater_done)) {
auto req = node_ops_cmd_request{node_ops_cmd::removenode_heartbeat, uuid, {}, {}};
parallel_for_each(nodes, [&ss, &req, uuid] (const gms::inet_address& node) {
return ss._messaging.local().send_node_ops_cmd(netw::msg_addr(node), req).then([uuid, node] (node_ops_cmd_response resp) {
slogger.debug("removenode[{}]: Got heartbeat response from node={}", uuid, node);
return make_ready_future<>();
});
}).handle_exception([uuid] (std::exception_ptr ep) {
slogger.warn("removenode[{}]: Failed to send heartbeat", uuid);
}).get();
int nr_seconds = 10;
while (!(*heartbeat_updater_done) && nr_seconds--) {
sleep_abortable(std::chrono::seconds(1), ss._abort_source).get();
}
}
slogger.debug("removenode[{}]: Stopped heartbeat_updater", uuid);
});
auto stop_heartbeat_updater = defer([&] {
*heartbeat_updater_done = true;
heartbeat_updater.get();
});
// kick off streaming commands
// No need to wait for restore_replica_count to complete, since
// when it completes, the node will be removed from _replicating_nodes,
// and we wait for _replicating_nodes to become empty below
//FIXME: discarded future.
(void)ss.restore_replica_count(endpoint, my_address).handle_exception([endpoint, my_address] (auto ep) {
slogger.info("Failed to restore_replica_count for node {} on node {}", endpoint, my_address);
});
// Step 4: Start to sync data
req.cmd = node_ops_cmd::removenode_sync_data;
parallel_for_each(nodes, [&ss, &req, uuid] (const gms::inet_address& node) {
return ss._messaging.local().send_node_ops_cmd(netw::msg_addr(node), req).then([uuid, node] (node_ops_cmd_response resp) {
slogger.debug("removenode[{}]: Got sync_data response from node={}", uuid, node);
return make_ready_future<>();
});
}).get();
// Step 5: Announce the node has left
std::unordered_set<token> tmp(tokens.begin(), tokens.end());
ss.excise(std::move(tmp), endpoint);
ss._gossiper.advertise_token_removed(endpoint, host_id).get();
// Step 6: Finish
req.cmd = node_ops_cmd::removenode_done;
parallel_for_each(nodes, [&ss, &req, uuid] (const gms::inet_address& node) {
return ss._messaging.local().send_node_ops_cmd(netw::msg_addr(node), req).then([uuid, node] (node_ops_cmd_response resp) {
slogger.debug("removenode[{}]: Got done response from node={}", uuid, node);
return make_ready_future<>();
});
}).get();
slogger.info("removenode[{}]: Finished removenode operation, removing node={}, sync_nodes={}, ignore_nodes={}", uuid, endpoint, nodes, ignore_nodes);
} catch (...) {
// we need to revert the effect of prepare verb the removenode ops is failed
req.cmd = node_ops_cmd::removenode_abort;
parallel_for_each(nodes, [&ss, &req, &nodes_unknown_verb, &nodes_down, uuid] (const gms::inet_address& node) {
if (nodes_unknown_verb.contains(node) || nodes_down.contains(node)) {
// No need to revert previous prepare cmd for those who do not apply prepare cmd.
return make_ready_future<>();
}
return ss._messaging.local().send_node_ops_cmd(netw::msg_addr(node), req).then([uuid, node] (node_ops_cmd_response resp) {
slogger.debug("removenode[{}]: Got abort response from node={}", uuid, node);
});
}).get();
slogger.info("removenode[{}]: Aborted removenode operation, removing node={}, sync_nodes={}, ignore_nodes={}", uuid, endpoint, nodes, ignore_nodes);
throw;
// wait for ReplicationFinishedVerbHandler to signal we're done
while (!(ss._replicating_nodes.empty() || ss._force_remove_completion)) {
sleep_abortable(std::chrono::milliseconds(100), ss._abort_source).get();
}
});
});
}
future<node_ops_cmd_response> storage_service::node_ops_cmd_handler(gms::inet_address coordinator, node_ops_cmd_request req) {
return get_storage_service().invoke_on(0, [coordinator, req = std::move(req)] (auto& ss) mutable {
return seastar::async([&ss, coordinator, req = std::move(req)] () mutable {
auto ops_uuid = req.ops_uuid;
slogger.debug("node_ops_cmd_handler cmd={}, ops_uuid={}", uint32_t(req.cmd), ops_uuid);
if (req.cmd == node_ops_cmd::removenode_prepare) {
if (req.leaving_nodes.size() > 1) {
auto msg = format("removenode[{}]: Could not removenode more than one node at a time: leaving_nodes={}", req.ops_uuid, req.leaving_nodes);
slogger.warn("{}", msg);
throw std::runtime_error(msg);
}
ss.mutate_token_metadata([coordinator, &req, &ss] (mutable_token_metadata_ptr tmptr) mutable {
for (auto& node : req.leaving_nodes) {
slogger.info("removenode[{}]: Added node={} as leaving node, coordinator={}", req.ops_uuid, node, coordinator);
tmptr->add_leaving_endpoint(node);
}
return ss.update_pending_ranges(tmptr, format("removenode {}", req.leaving_nodes));
}).get();
auto ops = seastar::make_shared<node_ops_info>(node_ops_info{ops_uuid, false, std::move(req.ignore_nodes)});
auto meta = node_ops_meta_data(ops_uuid, coordinator, std::move(ops), [&ss, coordinator, req = std::move(req)] () mutable {
return ss.mutate_token_metadata([&ss, coordinator, req = std::move(req)] (mutable_token_metadata_ptr tmptr) mutable {
for (auto& node : req.leaving_nodes) {
slogger.info("removenode[{}]: Removed node={} as leaving node, coordinator={}", req.ops_uuid, node, coordinator);
tmptr->del_leaving_endpoint(node);
}
return ss.update_pending_ranges(tmptr, format("removenode {}", req.leaving_nodes));
});
},
[&ss, ops_uuid] () mutable { ss.node_ops_singal_abort(ops_uuid); });
ss._node_ops.emplace(ops_uuid, std::move(meta));
} else if (req.cmd == node_ops_cmd::removenode_heartbeat) {
slogger.debug("removenode[{}]: Updated heartbeat from coordinator={}", req.ops_uuid, coordinator);
ss.node_ops_update_heartbeat(ops_uuid);
} else if (req.cmd == node_ops_cmd::removenode_done) {
slogger.info("removenode[{}]: Marked ops done from coordinator={}", req.ops_uuid, coordinator);
ss.node_ops_done(ops_uuid);
} else if (req.cmd == node_ops_cmd::removenode_sync_data) {
auto it = ss._node_ops.find(ops_uuid);
if (it == ss._node_ops.end()) {
throw std::runtime_error(format("removenode[{}]: Can not find ops_uuid={}", ops_uuid, ops_uuid));
}
auto ops = it->second.get_ops_info();
for (auto& node : req.leaving_nodes) {
slogger.info("removenode[{}]: Started to sync data for removing node={}, coordinator={}", req.ops_uuid, node, coordinator);
removenode_with_repair(ss._db, ss._messaging, ss.get_token_metadata_ptr(), node, ops).get();
}
} else if (req.cmd == node_ops_cmd::removenode_abort) {
ss.node_ops_abort(ops_uuid);
} else {
auto msg = format("node_ops_cmd_handler: ops_uuid={}, unknown cmd={}", req.ops_uuid, uint32_t(req.cmd));
slogger.warn("{}", msg);
throw std::runtime_error(msg);
if (ss._force_remove_completion) {
throw std::runtime_error("nodetool removenode force is called by user");
}
node_ops_cmd_response resp;
resp.ok = true;
return resp;
std::unordered_set<token> tmp(tokens.begin(), tokens.end());
ss.excise(std::move(tmp), endpoint);
// gossiper will indicate the token has left
ss._gossiper.advertise_token_removed(endpoint, host_id).get();
ss._replicating_nodes.clear();
ss._removing_node = std::nullopt;
});
});
}
@@ -2466,7 +2386,7 @@ future<> storage_service::rebuild(sstring source_dc) {
slogger.info("Streaming for rebuild successful");
}).handle_exception([] (auto ep) {
// This is used exclusively through JMX, so log the full trace but only throw a simple RTE
slogger.warn("Error while rebuilding node: {}", std::current_exception());
slogger.warn("Error while rebuilding node: {}", ep);
return make_exception_future<>(std::move(ep));
});
});
@@ -2606,9 +2526,7 @@ void storage_service::unbootstrap() {
future<> storage_service::restore_replica_count(inet_address endpoint, inet_address notify_endpoint) {
if (is_repair_based_node_ops_enabled()) {
auto ops_uuid = utils::make_random_uuid();
auto ops = seastar::make_shared<node_ops_info>(node_ops_info{ops_uuid, false, std::list<gms::inet_address>()});
return removenode_with_repair(_db, _messaging, get_token_metadata_ptr(), endpoint, ops).finally([this, notify_endpoint] () {
return removenode_with_repair(_db, _messaging, get_token_metadata_ptr(), endpoint).finally([this, notify_endpoint] () {
return send_replication_notification(notify_endpoint);
});
}
@@ -3233,13 +3151,13 @@ void storage_service::notify_down(inet_address endpoint) {
container().invoke_on_all([endpoint] (auto&& ss) {
ss._messaging.local().remove_rpc_client(netw::msg_addr{endpoint, 0});
return seastar::async([&ss, endpoint] {
for (auto&& subscriber : ss._lifecycle_subscribers) {
ss._lifecycle_subscribers.for_each([endpoint] (endpoint_lifecycle_subscriber* subscriber) {
try {
subscriber->on_down(endpoint);
} catch (...) {
slogger.warn("Down notification failed {}: {}", endpoint, std::current_exception());
}
}
});
});
}).get();
slogger.debug("Notify node {} has been down", endpoint);
@@ -3248,13 +3166,13 @@ void storage_service::notify_down(inet_address endpoint) {
void storage_service::notify_left(inet_address endpoint) {
container().invoke_on_all([endpoint] (auto&& ss) {
return seastar::async([&ss, endpoint] {
for (auto&& subscriber : ss._lifecycle_subscribers) {
ss._lifecycle_subscribers.for_each([endpoint] (endpoint_lifecycle_subscriber* subscriber) {
try {
subscriber->on_leave_cluster(endpoint);
} catch (...) {
slogger.warn("Leave cluster notification failed {}: {}", endpoint, std::current_exception());
}
}
});
});
}).get();
slogger.debug("Notify node {} has left the cluster", endpoint);
@@ -3267,13 +3185,13 @@ void storage_service::notify_up(inet_address endpoint)
}
container().invoke_on_all([endpoint] (auto&& ss) {
return seastar::async([&ss, endpoint] {
for (auto&& subscriber : ss._lifecycle_subscribers) {
ss._lifecycle_subscribers.for_each([endpoint] (endpoint_lifecycle_subscriber* subscriber) {
try {
subscriber->on_up(endpoint);
} catch (...) {
slogger.warn("Up notification failed {}: {}", endpoint, std::current_exception());
}
}
});
});
}).get();
slogger.debug("Notify node {} has been up", endpoint);
@@ -3287,13 +3205,13 @@ void storage_service::notify_joined(inet_address endpoint)
container().invoke_on_all([endpoint] (auto&& ss) {
return seastar::async([&ss, endpoint] {
for (auto&& subscriber : ss._lifecycle_subscribers) {
ss._lifecycle_subscribers.for_each([endpoint] (endpoint_lifecycle_subscriber* subscriber) {
try {
subscriber->on_join_cluster(endpoint);
} catch (...) {
slogger.warn("Join cluster notification failed {}: {}", endpoint, std::current_exception());
}
}
});
});
}).get();
slogger.debug("Notify node {} has joined the cluster", endpoint);
@@ -3323,111 +3241,5 @@ bool storage_service::is_repair_based_node_ops_enabled() {
return _db.local().get_config().enable_repair_based_node_ops();
}
node_ops_meta_data::node_ops_meta_data(
utils::UUID ops_uuid,
gms::inet_address coordinator,
shared_ptr<node_ops_info> ops,
std::function<future<> ()> abort_func,
std::function<void ()> signal_func)
: _ops_uuid(std::move(ops_uuid))
, _coordinator(std::move(coordinator))
, _abort(std::move(abort_func))
, _signal(std::move(signal_func))
, _ops(std::move(ops))
, _watchdog([sig = _signal] { sig(); }) {
_watchdog.arm(_watchdog_interval);
}
future<> node_ops_meta_data::abort() {
slogger.debug("node_ops_meta_data: ops_uuid={} abort", _ops_uuid);
_aborted = true;
if (_ops) {
_ops->abort = true;
}
_watchdog.cancel();
return _abort();
}
void node_ops_meta_data::update_watchdog() {
slogger.debug("node_ops_meta_data: ops_uuid={} update_watchdog", _ops_uuid);
if (_aborted) {
return;
}
_watchdog.cancel();
_watchdog.arm(_watchdog_interval);
}
void node_ops_meta_data::cancel_watchdog() {
slogger.debug("node_ops_meta_data: ops_uuid={} cancel_watchdog", _ops_uuid);
_watchdog.cancel();
}
shared_ptr<node_ops_info> node_ops_meta_data::get_ops_info() {
return _ops;
}
void storage_service::node_ops_update_heartbeat(utils::UUID ops_uuid) {
slogger.debug("node_ops_update_heartbeat: ops_uuid={}", ops_uuid);
auto permit = seastar::get_units(_node_ops_abort_sem, 1);
auto it = _node_ops.find(ops_uuid);
if (it != _node_ops.end()) {
node_ops_meta_data& meta = it->second;
meta.update_watchdog();
}
}
void storage_service::node_ops_done(utils::UUID ops_uuid) {
slogger.debug("node_ops_done: ops_uuid={}", ops_uuid);
auto permit = seastar::get_units(_node_ops_abort_sem, 1);
auto it = _node_ops.find(ops_uuid);
if (it != _node_ops.end()) {
node_ops_meta_data& meta = it->second;
meta.cancel_watchdog();
_node_ops.erase(it);
}
}
void storage_service::node_ops_abort(utils::UUID ops_uuid) {
slogger.debug("node_ops_abort: ops_uuid={}", ops_uuid);
auto permit = seastar::get_units(_node_ops_abort_sem, 1);
auto it = _node_ops.find(ops_uuid);
if (it != _node_ops.end()) {
node_ops_meta_data& meta = it->second;
meta.abort().get();
abort_repair_node_ops(ops_uuid).get();
_node_ops.erase(it);
}
}
void storage_service::node_ops_singal_abort(std::optional<utils::UUID> ops_uuid) {
slogger.debug("node_ops_singal_abort: ops_uuid={}", ops_uuid);
_node_ops_abort_queue.push_back(ops_uuid);
_node_ops_abort_cond.signal();
}
future<> storage_service::node_ops_abort_thread() {
return seastar::async([this] {
slogger.info("Started node_ops_abort_thread");
for (;;) {
_node_ops_abort_cond.wait([this] { return !_node_ops_abort_queue.empty(); }).get();
slogger.debug("Awoke node_ops_abort_thread: node_ops_abort_queue={}", _node_ops_abort_queue);
while (!_node_ops_abort_queue.empty()) {
auto uuid_opt = _node_ops_abort_queue.front();
_node_ops_abort_queue.pop_front();
if (!uuid_opt) {
return;
}
try {
storage_service::node_ops_abort(*uuid_opt);
} catch (...) {
slogger.warn("Failed to abort node operation ops_uuid={}: {}", *uuid_opt, std::current_exception());
}
}
}
slogger.info("Stopped node_ops_abort_thread");
});
}
} // namespace service

View File

@@ -63,12 +63,6 @@
#include <seastar/core/rwlock.hh>
#include "sstables/version.hh"
#include "cdc/metadata.hh"
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/lowres_clock.hh>
class node_ops_cmd_request;
class node_ops_cmd_response;
class node_ops_info;
namespace cql_transport { class controller; }
@@ -109,28 +103,6 @@ struct storage_service_config {
size_t available_memory;
};
class node_ops_meta_data {
utils::UUID _ops_uuid;
gms::inet_address _coordinator;
std::function<future<> ()> _abort;
std::function<void ()> _signal;
shared_ptr<node_ops_info> _ops;
seastar::timer<lowres_clock> _watchdog;
std::chrono::seconds _watchdog_interval{30};
bool _aborted = false;
public:
explicit node_ops_meta_data(
utils::UUID ops_uuid,
gms::inet_address coordinator,
shared_ptr<node_ops_info> ops,
std::function<future<> ()> abort_func,
std::function<void ()> signal_func);
shared_ptr<node_ops_info> get_ops_info();
future<> abort();
void update_watchdog();
void cancel_watchdog();
};
/**
* This abstraction contains the token/identifier of this node
* on the identifier space. This token gets gossiped around.
@@ -186,17 +158,6 @@ private:
* and would only slow down tests (by having them wait).
*/
bool _for_testing;
std::unordered_map<utils::UUID, node_ops_meta_data> _node_ops;
std::list<std::optional<utils::UUID>> _node_ops_abort_queue;
seastar::condition_variable _node_ops_abort_cond;
named_semaphore _node_ops_abort_sem{1, named_semaphore_exception_factory{"node_ops_abort_sem"}};
future<> _node_ops_abort_thread;
void node_ops_update_heartbeat(utils::UUID ops_uuid);
void node_ops_done(utils::UUID ops_uuid);
void node_ops_abort(utils::UUID ops_uuid);
void node_ops_singal_abort(std::optional<utils::UUID> ops_uuid);
future<> node_ops_abort_thread();
public:
storage_service(abort_source& as, distributed<database>& db, gms::gossiper& gossiper, sharded<db::system_distributed_keyspace>&, sharded<db::view::view_update_generator>&, gms::feature_service& feature_service, storage_service_config config, sharded<service::migration_notifier>& mn, locator::shared_token_metadata& stm, sharded<netw::messaging_service>& ms, /* only for tests */ bool for_testing = false);
@@ -315,7 +276,7 @@ private:
drain_progress _drain_progress{};
std::vector<endpoint_lifecycle_subscriber*> _lifecycle_subscribers;
atomic_vector<endpoint_lifecycle_subscriber*> _lifecycle_subscribers;
std::unordered_set<token> _bootstrap_tokens;
@@ -336,7 +297,7 @@ public:
void register_subscriber(endpoint_lifecycle_subscriber* subscriber);
void unregister_subscriber(endpoint_lifecycle_subscriber* subscriber);
future<> unregister_subscriber(endpoint_lifecycle_subscriber* subscriber) noexcept;
// should only be called via JMX
future<> stop_gossiping();
@@ -810,8 +771,7 @@ public:
*
* @param hostIdString token for the node
*/
future<> removenode(sstring host_id_string, std::list<gms::inet_address> ignore_nodes);
future<node_ops_cmd_response> node_ops_cmd_handler(gms::inet_address coordinator, node_ops_cmd_request req);
future<> removenode(sstring host_id_string);
future<sstring> get_operation_mode();

View File

@@ -310,6 +310,7 @@ struct compaction_read_monitor_generator final : public read_monitor_generator {
_generated_monitors.emplace_back(std::move(sst), _compaction_manager, _cf);
return _generated_monitors.back();
}
compaction_read_monitor_generator(compaction_manager& cm, column_family& cf)
: _compaction_manager(cm)
, _cf(cf) {}
@@ -570,6 +571,7 @@ private:
// Do not actually compact a sstable that is fully expired and can be safely
// dropped without ressurrecting old data.
if (tombstone_expiration_enabled() && fully_expired.contains(sst)) {
on_skipped_expired_sstable(sst);
continue;
}
@@ -676,6 +678,9 @@ private:
virtual void on_end_of_compaction() {};
// Inform about every expired sstable that was skipped during setup phase
virtual void on_skipped_expired_sstable(shared_sstable sstable) {}
// create a writer based on decorated key.
virtual compaction_writer create_compaction_writer(const dht::decorated_key& dk) = 0;
// stop current writer
@@ -919,6 +924,12 @@ public:
}
replace_remaining_exhausted_sstables();
}
virtual void on_skipped_expired_sstable(shared_sstable sstable) override {
// manually register expired sstable into monitor, as it's not being actually compacted
// this will allow expired sstable to be removed from tracker once compaction completes
_monitor_generator(std::move(sstable));
}
private:
void backlog_tracker_incrementally_adjust_charges(std::vector<shared_sstable> exhausted_sstables) {
//

View File

@@ -629,10 +629,11 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
_tasks.push_back(task);
auto sstables = std::make_unique<std::vector<sstables::shared_sstable>>(get_func(*cf));
auto compacting = make_lw_shared<compacting_sstable_registration>(this, *sstables);
auto sstables_ptr = sstables.get();
_stats.pending_tasks += sstables->size();
task->compaction_done = do_until([sstables_ptr] { return sstables_ptr->empty(); }, [this, task, options, sstables_ptr] () mutable {
task->compaction_done = do_until([sstables_ptr] { return sstables_ptr->empty(); }, [this, task, options, sstables_ptr, compacting] () mutable {
// FIXME: lock cf here
if (!can_proceed(task)) {
@@ -642,7 +643,7 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
auto sst = sstables_ptr->back();
sstables_ptr->pop_back();
return repeat([this, task, options, sst = std::move(sst)] () mutable {
return repeat([this, task, options, sst = std::move(sst), compacting] () mutable {
column_family& cf = *task->compacting_cf;
auto sstable_level = sst->get_sstable_level();
auto run_identifier = sst->run_identifier();
@@ -650,21 +651,22 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
auto descriptor = sstables::compaction_descriptor({ sst }, cf.get_sstable_set(), service::get_local_compaction_priority(),
sstable_level, sstables::compaction_descriptor::default_max_sstable_bytes, run_identifier, options);
auto compacting = make_lw_shared<compacting_sstable_registration>(this, descriptor.sstables);
// Releases reference to cleaned sstable such that respective used disk space can be freed.
descriptor.release_exhausted = [compacting] (const std::vector<sstables::shared_sstable>& exhausted_sstables) {
compacting->release_compacting(exhausted_sstables);
};
_stats.pending_tasks--;
_stats.active_tasks++;
task->compaction_running = true;
compaction_backlog_tracker user_initiated(std::make_unique<user_initiated_backlog_tracker>(_compaction_controller.backlog_of_shares(200), _available_memory));
return do_with(std::move(user_initiated), [this, &cf, descriptor = std::move(descriptor)] (compaction_backlog_tracker& bt) mutable {
return with_scheduling_group(_scheduling_group, [this, &cf, descriptor = std::move(descriptor)] () mutable {
return cf.run_compaction(std::move(descriptor));
return with_semaphore(_rewrite_sstables_sem, 1, [this, task, &cf, descriptor = std::move(descriptor)] () mutable {
_stats.pending_tasks--;
_stats.active_tasks++;
task->compaction_running = true;
compaction_backlog_tracker user_initiated(std::make_unique<user_initiated_backlog_tracker>(_compaction_controller.backlog_of_shares(200), _available_memory));
return do_with(std::move(user_initiated), [this, &cf, descriptor = std::move(descriptor)] (compaction_backlog_tracker& bt) mutable {
return with_scheduling_group(_scheduling_group, [this, &cf, descriptor = std::move(descriptor)]() mutable {
return cf.run_compaction(std::move(descriptor));
});
});
}).then_wrapped([this, task, compacting = std::move(compacting)] (future<> f) mutable {
}).then_wrapped([this, task, compacting] (future<> f) mutable {
task->compaction_running = false;
_stats.active_tasks--;
if (!can_proceed(task)) {

View File

@@ -111,6 +111,7 @@ private:
std::unordered_map<column_family*, rwlock> _compaction_locks;
semaphore _custom_job_sem{1};
seastar::named_semaphore _rewrite_sstables_sem = {1, named_semaphore_exception_factory{"rewrite sstables"}};
std::function<void()> compaction_submission_callback();
// all registered column families are submitted for compaction at a constant interval.

View File

@@ -178,7 +178,7 @@ leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input
unsigned max_filled_level = 0;
size_t offstrategy_threshold = std::max(schema->min_compaction_threshold(), 4);
size_t offstrategy_threshold = (mode == reshape_mode::strict) ? std::max(schema->min_compaction_threshold(), 4) : std::max(schema->max_compaction_threshold(), 32);
size_t max_sstables = std::max(schema->max_compaction_threshold(), int(offstrategy_threshold));
auto tolerance = [mode] (unsigned level) -> unsigned {
if (mode == reshape_mode::strict) {

View File

@@ -162,7 +162,7 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> i
for (auto& pair : all_buckets.first) {
auto ssts = std::move(pair.second);
if (ssts.size() > offstrategy_threshold) {
ssts.resize(std::min(multi_window.size(), max_sstables));
ssts.resize(std::min(ssts.size(), max_sstables));
compaction_descriptor desc(std::move(ssts), std::optional<sstables::sstable_set>(), iop);
desc.options = compaction_options::make_reshape();
return desc;

View File

@@ -380,7 +380,7 @@ future<prepare_message> stream_session::prepare(std::vector<stream_request> requ
try {
db.find_column_family(ks, cf);
} catch (no_such_column_family&) {
auto err = format("[Stream #{{}}] prepare requested ks={{}} cf={{}} does not exist", ks, cf);
auto err = format("[Stream #{{}}] prepare requested ks={{}} cf={{}} does not exist", plan_id, ks, cf);
sslog.warn(err.c_str());
throw std::runtime_error(err);
}

View File

@@ -1677,7 +1677,8 @@ write_memtable_to_sstable(flat_mutation_reader reader,
const io_priority_class& pc) {
cfg.replay_position = mt.replay_position();
cfg.monitor = &monitor;
return sst->write_components(std::move(reader), mt.partition_count(), mt.schema(), cfg, mt.get_encoding_stats(), pc);
schema_ptr s = reader.schema();
return sst->write_components(std::move(reader), mt.partition_count(), s, cfg, mt.get_encoding_stats(), pc);
}
future<>

View File

@@ -80,7 +80,7 @@ def dynamodb(request):
verify = not request.config.getoption('https')
return boto3.resource('dynamodb', endpoint_url=local_url, verify=verify,
region_name='us-east-1', aws_access_key_id='alternator', aws_secret_access_key='secret_pass',
config=botocore.client.Config(retries={"max_attempts": 3}))
config=botocore.client.Config(retries={"max_attempts": 0}, read_timeout=300))
@pytest.fixture(scope="session")
def dynamodbstreams(request):
@@ -230,3 +230,19 @@ def filled_test_table(dynamodb):
def scylla_only(dynamodb):
if dynamodb.meta.client._endpoint.host.endswith('.amazonaws.com'):
pytest.skip('Scylla-only feature not supported by AWS')
# The "test_table_s_forbid_rmw" fixture is the same as test_table_s, except
# with the "forbid_rmw" write isolation mode. This is useful for verifying
# that writes that we think should not need a read-before-write in fact do
# not need it.
# Because forbid_rmw is a Scylla-only feature, this test is skipped when not
# running against Scylla.
@pytest.fixture(scope="session")
def test_table_s_forbid_rmw(dynamodb, scylla_only):
table = create_test_table(dynamodb,
KeySchema=[ { 'AttributeName': 'p', 'KeyType': 'HASH' }, ],
AttributeDefinitions=[ { 'AttributeName': 'p', 'AttributeType': 'S' } ])
arn = table.meta.client.describe_table(TableName=table.name)['Table']['TableArn']
table.meta.client.tag_resource(ResourceArn=arn, Tags=[{'Key': 'system:write_isolation', 'Value': 'forbid_rmw'}])
yield table
table.delete()

View File

@@ -136,7 +136,7 @@ def test_update_condition_eq_different(test_table_s):
ConditionExpression='a = :val2',
ExpressionAttributeValues={':val1': val1, ':val2': val2})
# Also check an actual case of same time, but inequality.
# Also check an actual case of same type, but inequality.
def test_update_condition_eq_unequal(test_table_s):
p = random_string()
test_table_s.update_item(Key={'p': p},
@@ -146,6 +146,13 @@ def test_update_condition_eq_unequal(test_table_s):
UpdateExpression='SET a = :val1',
ConditionExpression='a = :oldval',
ExpressionAttributeValues={':val1': 3, ':oldval': 2})
# If the attribute being compared doesn't exist, it's considered a failed
# condition, not an error:
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET a = :val1',
ConditionExpression='q = :oldval',
ExpressionAttributeValues={':val1': 3, ':oldval': 2})
# Check that set equality is checked correctly. Unlike string equality (for
# example), it cannot be done with just naive string comparison of the JSON
@@ -269,15 +276,44 @@ def test_update_condition_lt(test_table_s):
UpdateExpression='SET z = :newval',
ConditionExpression='a < :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
# Trying to compare an unsupported type - e.g., in the following test
# a boolean, is unfortunately caught by boto3 and cannot be tested here...
#test_table_s.update_item(Key={'p': p},
# AttributeUpdates={'d': {'Value': False, 'Action': 'PUT'}})
#with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
# test_table_s.update_item(Key={'p': p},
# UpdateExpression='SET z = :newval',
# ConditionExpression='d < :oldval',
# ExpressionAttributeValues={':newval': 2, ':oldval': True})
# If the attribute being compared doesn't even exist, this is also
# considered as a false condition - not an error.
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='q < :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval < q',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
# If a comparison parameter comes from a constant specified in the query,
# and it has a type not supported by the comparison (e.g., a list), it's
# not just a failed comparison - it is considered a ValidationException
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a < :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval < a',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
# However, if when the wrong type comes from an item attribute, not the
# query, the comparison is simply false - not a ValidationException.
test_table_s.update_item(Key={'p': p}, AttributeUpdates={'x': {'Value': [1,2,3], 'Action': 'PUT'}})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='x < :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval < x',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 4
# Test for ConditionExpression with operator "<="
@@ -341,6 +377,44 @@ def test_update_condition_le(test_table_s):
UpdateExpression='SET z = :newval',
ConditionExpression='a <= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
# If the attribute being compared doesn't even exist, this is also
# considered as a false condition - not an error.
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='q <= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval <= q',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
# If a comparison parameter comes from a constant specified in the query,
# and it has a type not supported by the comparison (e.g., a list), it's
# not just a failed comparison - it is considered a ValidationException
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a <= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval <= a',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
# However, if when the wrong type comes from an item attribute, not the
# query, the comparison is simply false - not a ValidationException.
test_table_s.update_item(Key={'p': p}, AttributeUpdates={'x': {'Value': [1,2,3], 'Action': 'PUT'}})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='x <= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval <= x',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 7
# Test for ConditionExpression with operator ">"
@@ -404,6 +478,44 @@ def test_update_condition_gt(test_table_s):
UpdateExpression='SET z = :newval',
ConditionExpression='a > :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
# If the attribute being compared doesn't even exist, this is also
# considered as a false condition - not an error.
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='q > :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval > q',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
# If a comparison parameter comes from a constant specified in the query,
# and it has a type not supported by the comparison (e.g., a list), it's
# not just a failed comparison - it is considered a ValidationException
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a > :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval > a',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
# However, if when the wrong type comes from an item attribute, not the
# query, the comparison is simply false - not a ValidationException.
test_table_s.update_item(Key={'p': p}, AttributeUpdates={'x': {'Value': [1,2,3], 'Action': 'PUT'}})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='x > :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval > x',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 4
# Test for ConditionExpression with operator ">="
@@ -467,6 +579,44 @@ def test_update_condition_ge(test_table_s):
UpdateExpression='SET z = :newval',
ConditionExpression='a >= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '0'})
# If the attribute being compared doesn't even exist, this is also
# considered as a false condition - not an error.
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='q >= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval >= q',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
# If a comparison parameter comes from a constant specified in the query,
# and it has a type not supported by the comparison (e.g., a list), it's
# not just a failed comparison - it is considered a ValidationException
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a >= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval >= a',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
# However, if when the wrong type comes from an item attribute, not the
# query, the comparison is simply false - not a ValidationException.
test_table_s.update_item(Key={'p': p}, AttributeUpdates={'x': {'Value': [1,2,3], 'Action': 'PUT'}})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='x >= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval >= x',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 7
# Test for ConditionExpression with ternary operator "BETWEEN" (checking
@@ -548,6 +698,60 @@ def test_update_condition_between(test_table_s):
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN :oldval1 AND :oldval2',
ExpressionAttributeValues={':newval': 2, ':oldval1': '0', ':oldval2': '2'})
# If the attribute being compared doesn't even exist, this is also
# considered as a false condition - not an error.
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='q BETWEEN :oldval1 AND :oldval2',
ExpressionAttributeValues={':newval': 2, ':oldval1': b'dog', ':oldval2': b'zebra'})
# If and operand from the query, and it has a type not supported by the
# comparison (e.g., a list), it's not just a failed condition - it is
# considered a ValidationException
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN :oldval1 AND :oldval2',
ExpressionAttributeValues={':newval': 2, ':oldval1': [1,2], ':oldval2': [2,3]})
# However, if when the wrong type comes from an item attribute, not the
# query, the comparison is simply false - not a ValidationException.
test_table_s.update_item(Key={'p': p}, AttributeUpdates={'x': {'Value': [1,2,3], 'Action': 'PUT'},
'y': {'Value': [2,3,4], 'Action': 'PUT'}})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN x and y',
ExpressionAttributeValues={':newval': 2})
# If the two operands come from the query (":val" references) then if they
# have different types or the wrong order, this is a ValidationException.
# But if one or more of the operands come from the item, this only causes
# a false condition - not a ValidationException.
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN :oldval1 AND :oldval2',
ExpressionAttributeValues={':newval': 2, ':oldval1': 2, ':oldval2': 1})
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN :oldval1 AND :oldval2',
ExpressionAttributeValues={':newval': 2, ':oldval1': 2, ':oldval2': 'dog'})
test_table_s.update_item(Key={'p': p}, AttributeUpdates={'two': {'Value': 2, 'Action': 'PUT'}})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN two AND :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN :oldval AND two',
ExpressionAttributeValues={':newval': 2, ':oldval': 3})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN two AND :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': 'dog'})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 9
# Test for ConditionExpression with multi-operand operator "IN", checking
@@ -605,6 +809,13 @@ def test_update_condition_in(test_table_s):
UpdateExpression='SET c = :val37',
ConditionExpression='a IN ()',
ExpressionAttributeValues=values)
# If the attribute being compared doesn't even exist, this is also
# considered as a false condition - not an error.
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET c = :val37',
ConditionExpression='q IN ({})'.format(','.join(values.keys())),
ExpressionAttributeValues=values)
# Beyond the above operators, there are also test functions supported -
# attribute_exists, attribute_not_exists, attribute_type, begins_with,
@@ -1051,7 +1262,6 @@ def test_update_condition_other_funcs(test_table_s):
# ConditionExpressions also allows reading nested attributes, and we should
# support that too. This test just checks a few random operators - we don't
# test all the different operators here.
@pytest.mark.xfail(reason="nested attributes not yet implemented in ConditionExpression")
def test_update_condition_nested_attributes(test_table_s):
p = random_string()
test_table_s.update_item(Key={'p': p},
@@ -1066,6 +1276,21 @@ def test_update_condition_nested_attributes(test_table_s):
ConditionExpression='b.x < b.y[1]',
ExpressionAttributeValues={':val': 2})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['c'] == 2
# Also check the case of a failing condition
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET c = :val',
ConditionExpression='b.x < b.y[0]',
ExpressionAttributeValues={':val': 3})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['c'] == 2
# A condition involving an attribute which doesn't exist results in
# failed condition - not an error.
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET c = :val',
ConditionExpression='b.z < b.y[100]',
ExpressionAttributeValues={':val': 4})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['c'] == 2
# All the previous tests refered to attributes using their name directly.
# But the DynamoDB API also allows to refer to attributes using a #reference.
@@ -1082,16 +1307,15 @@ def test_update_condition_attribute_reference(test_table_s):
ExpressionAttributeValues={':val': 1})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['c'] == 1
@pytest.mark.xfail(reason="nested attributes not yet implemented in ConditionExpression")
def test_update_condition_nested_attribute_reference(test_table_s):
p = random_string()
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'and': {'Value': {'or': 1}, 'Action': 'PUT'}})
AttributeUpdates={'and': {'Value': {'or': 2}, 'Action': 'PUT'}})
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET c = :val',
ConditionExpression='attribute_exists (#name1.#name2)',
ConditionExpression='#name1.#name2 = :two',
ExpressionAttributeNames={'#name1': 'and', '#name2': 'or'},
ExpressionAttributeValues={':val': 1})
ExpressionAttributeValues={':val': 1, ':two': 2})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['c'] == 1
# All the previous tests involved a single condition. The following tests

View File

@@ -237,6 +237,30 @@ def test_update_expected_1_le(test_table_s):
'AttributeValueList': [2, 3]}}
)
# Comparison operators like le work only on numbers, strings or bytes.
# As noted in issue #8043, if any other type is included in *the query*,
# the result should be a ValidationException, but if the wrong type appears
# in the item, not the query, the result is a failed condition.
def test_update_expected_1_le_validation(test_table_s):
p = random_string()
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'a': {'Value': 1, 'Action': 'PUT'},
'b': {'Value': [1,2], 'Action': 'PUT'}})
# Bad type (a list) in the query. Result is ValidationException.
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'z': {'Value': 17, 'Action': 'PUT'}},
Expected={'a': {'ComparisonOperator': 'LE',
'AttributeValueList': [[1,2,3]]}}
)
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'z': {'Value': 17, 'Action': 'PUT'}},
Expected={'b': {'ComparisonOperator': 'LE',
'AttributeValueList': [3]}}
)
assert not 'z' in test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']
# Tests for Expected with ComparisonOperator = "LT":
def test_update_expected_1_lt(test_table_s):
p = random_string()
@@ -894,6 +918,34 @@ def test_update_expected_1_between(test_table_s):
AttributeUpdates={'z': {'Value': 2, 'Action': 'PUT'}},
Expected={'d': {'ComparisonOperator': 'BETWEEN', 'AttributeValueList': [set([1]), set([2])]}})
# BETWEEN work only on numbers, strings or bytes. As noted in issue #8043,
# if any other type is included in *the query*, the result should be a
# ValidationException, but if the wrong type appears in the item, not the
# query, the result is a failed condition.
# BETWEEN should also generate ValidationException if the two ends of the
# range are not of the same type or not in the correct order, but this
# already is tested in the test above (test_update_expected_1_between).
def test_update_expected_1_between_validation(test_table_s):
p = random_string()
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'a': {'Value': 1, 'Action': 'PUT'},
'b': {'Value': [1,2], 'Action': 'PUT'}})
# Bad type (a list) in the query. Result is ValidationException.
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'z': {'Value': 17, 'Action': 'PUT'}},
Expected={'a': {'ComparisonOperator': 'BETWEEN',
'AttributeValueList': [[1,2,3], [2,3,4]]}}
)
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'z': {'Value': 17, 'Action': 'PUT'}},
Expected={'b': {'ComparisonOperator': 'BETWEEN',
'AttributeValueList': [1,2]}}
)
assert not 'z' in test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']
##############################################################################
# Instead of ComparisonOperator and AttributeValueList, one can specify either
# Value or Exists:

View File

@@ -235,6 +235,30 @@ def test_filter_expression_ge(test_table_sn_with_data):
expected_items = [item for item in items if item[xn] >= xv]
assert(got_items == expected_items)
# Comparison operators such as >= or BETWEEN only work on numbers, strings or
# bytes. When an expression's operands come from the item and has a wrong type
# (e.g., a list), the result is that the item is skipped - aborting the scan
# with a ValidationException is a bug (this was issue #8043).
def test_filter_expression_le_bad_type(test_table_sn_with_data):
table, p, items = test_table_sn_with_data
got_items = full_query(table, KeyConditionExpression='p=:p', FilterExpression='l <= :xv',
ExpressionAttributeValues={':p': p, ':xv': 3})
assert got_items == []
got_items = full_query(table, KeyConditionExpression='p=:p', FilterExpression=':xv <= l',
ExpressionAttributeValues={':p': p, ':xv': 3})
assert got_items == []
def test_filter_expression_between_bad_type(test_table_sn_with_data):
table, p, items = test_table_sn_with_data
got_items = full_query(table, KeyConditionExpression='p=:p', FilterExpression='s between :xv and l',
ExpressionAttributeValues={':p': p, ':xv': 'cat'})
assert got_items == []
got_items = full_query(table, KeyConditionExpression='p=:p', FilterExpression='s between l and :xv',
ExpressionAttributeValues={':p': p, ':xv': 'cat'})
assert got_items == []
got_items = full_query(table, KeyConditionExpression='p=:p', FilterExpression='s between i and :xv',
ExpressionAttributeValues={':p': p, ':xv': 'cat'})
assert got_items == []
# Test the "BETWEEN/AND" ternary operator on a numeric, string and bytes
# attribute. These keywords are case-insensitive.
def test_filter_expression_between(test_table_sn_with_data):
@@ -719,7 +743,6 @@ def test_filter_expression_and_attributes_to_get(test_table):
# support that too. This test just checks one operators - we don't
# test all the different operators again here, we will assume the same
# code is used internally so if one operator worked, all will work.
@pytest.mark.xfail(reason="nested attributes not yet implemented in FilterExpression")
def test_filter_expression_nested_attribute(test_table):
p = random_string()
test_table.put_item(Item={'p': p, 'c': 'hi', 'x': {'a': 'dog', 'b': 3}})
@@ -730,7 +753,6 @@ def test_filter_expression_nested_attribute(test_table):
ExpressionAttributeValues={':p': p, ':a': 'mouse'})
assert(got_items == [{'p': p, 'c': 'yo', 'x': {'a': 'mouse', 'b': 4}}])
@pytest.mark.xfail(reason="nested attributes not yet implemented in FilterExpression")
# This test is a version of test_filter_expression_and_projection_expression
# involving nested attributes. In that test, we had a filter and projection
# involving different attributes. Nested attributes open new corner cases:

View File

@@ -334,6 +334,33 @@ def test_gsi_wrong_type_attribute_update(test_table_gsi_2):
test_table_gsi_2.update_item(Key={'p': p}, AttributeUpdates={'x': {'Value': 3, 'Action': 'PUT'}})
assert test_table_gsi_2.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'x': x}
# Since a GSI key x cannot be a map or an array, in particular updates to
# nested attributes like x.y or x[1] are not legal. The error that DynamoDB
# reports is "Key attributes must be scalars; list random access '[]' and map
# lookup '.' are not allowed: IndexKey: x".
def test_gsi_wrong_type_attribute_update_nested(test_table_gsi_2):
p = random_string()
x = random_string()
test_table_gsi_2.put_item(Item={'p': p, 'x': x})
# We can't write a map into a GSI key column, which in this case can only
# be a string and in any case can never be a map. DynamoDB and Alternator
# report different errors here: DynamoDB reports a type mismatch (exactly
# like in test test_gsi_wrong_type_attribute_update), but Alternator
# reports the obscure message "Malformed value object for key column x".
# Alternator's error message should probably be improved here, but let's
# not test it in this test.
with pytest.raises(ClientError, match='ValidationException'):
test_table_gsi_2.update_item(Key={'p': p}, UpdateExpression='SET x = :val1',
ExpressionAttributeValues={':val1': {'a': 3, 'b': 4}})
# Here we try to set x.y for the GSI key column x. Again DynamoDB and
# Alternator produce different error messages - but both make sense.
# DynamoDB says "Key attributes must be scalars; list random access '[]'
# and map # lookup '.' are not allowed: IndexKey: x", while Alternator
# complains that "document paths not valid for this item: x.y".
with pytest.raises(ClientError, match='ValidationException'):
test_table_gsi_2.update_item(Key={'p': p}, UpdateExpression='SET x.y = :val1',
ExpressionAttributeValues={':val1': 3})
def test_gsi_wrong_type_attribute_batch(test_table_gsi_2):
# In a BatchWriteItem, if any update is forbidden, the entire batch is
# rejected, and none of the updates happen at all.

View File

@@ -367,6 +367,13 @@ def test_getitem_attributes_to_get(dynamodb, test_table):
expected_item = {k: item[k] for k in wanted if k in item}
assert expected_item == got_item
# Verify that it is forbidden to ask for the same attribute multiple times
def test_getitem_attributes_to_get_duplicate(dynamodb, test_table):
p = random_string()
c = random_string()
with pytest.raises(ClientError, match='ValidationException.*Duplicate'):
test_table.get_item(Key={'p': p, 'c': c}, AttributesToGet=['a', 'a'], ConsistentRead=True)
# Basic test for DeleteItem, with hash key only
def test_delete_item_hash(test_table_s):
p = random_string()

View File

@@ -116,7 +116,6 @@ def test_projection_expression_query(test_table):
# but the previous test checked that the alternative syntax works correctly.
# The following test checks fetching more elaborate attribute paths from
# nested documents.
@pytest.mark.xfail(reason="ProjectionExpression does not yet support attribute paths")
def test_projection_expression_path(test_table_s):
p = random_string()
test_table_s.put_item(Item={
@@ -143,11 +142,21 @@ def test_projection_expression_path(test_table_s):
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a.x')['Item'] == {}
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a.x.y')['Item'] == {}
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a.b[3].x')['Item'] == {}
# Similarly, indexing a dictionary as an array, or array as dictionary, or
# integer as either, yields an empty item.
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a.b.x')['Item'] == {}
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a[0]')['Item'] == {}
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a.b[0].x')['Item'] == {}
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a.b[0][0]')['Item'] == {}
# We can read multiple paths - the result are merged into one object
# structured the same was as in the original item:
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a.b[0],a.b[1]')['Item'] == {'a': {'b': [2, 4]}}
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a.b[0],a.c')['Item'] == {'a': {'b': [2], 'c': 5}}
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a.c,b')['Item'] == {'a': {'c': 5}, 'b': 'hello'}
# If some of the paths are not available, they are silently ignored (just
# like they returned an empty item when used alone earlier)
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a.x, a.b[0], x, a.b[3].x')['Item'] == {'a': {'b': [2]}}
# It is not allowed to read the same path multiple times. The error from
# DynamoDB looks like: "Invalid ProjectionExpression: Two document paths
# overlap with each other; must remove or rewrite one of these paths;
@@ -160,6 +169,14 @@ def test_projection_expression_path(test_table_s):
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a,a.b[0]')['Item']
# Above we noted that asking for to project a non-existent attribute in an
# existing item yields an empty Item object. However, if the item does not
# exist at all, the Item object will be missing entirely:
p = random_string()
assert not 'Item' in test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='x')
assert not 'Item' in test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a.x')
assert not 'Item' in test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a[0]')
# Above in test_projection_expression_toplevel_syntax() we tested how
# name references (#name) work in top-level attributes. In the following
# two tests we test how they work in more elaborate paths:
@@ -167,7 +184,6 @@ def test_projection_expression_path(test_table_s):
# 2. Conversely, a single reference, e.g., "#a", is always a single path
# component. Even if "#a" is "a.b", this refers to the literal attribute
# "a.b" - with a dot in its name - and not to the b element in a.
@pytest.mark.xfail(reason="ProjectionExpression does not yet support attribute paths")
def test_projection_expression_path_references(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': {'b': 1, 'c': 2}, 'b': 'hi'})
@@ -181,10 +197,9 @@ def test_projection_expression_path_references(test_table_s):
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a.#n2', ExpressionAttributeNames={'#n2': 'b', '#unused': 'x'})
@pytest.mark.xfail(reason="ProjectionExpression does not yet support attribute paths")
def test_projection_expression_path_dot(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a.b': 'hi', 'a': {'b': 'yo'}})
test_table_s.put_item(Item={'p': p, 'a.b': 'hi', 'a': {'b': 'yo', 'c': 'jo'}})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a.b')['Item'] == {'a': {'b': 'yo'}}
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='#name', ExpressionAttributeNames={'#name': 'a.b'})['Item'] == {'a.b': 'hi'}
@@ -192,7 +207,6 @@ def test_projection_expression_path_dot(test_table_s):
# ProjectionExpression. This includes both identical paths, and paths where
# one is a sub-path of the other - e.g. "a.b" and "a.b.c". As we already saw
# above, paths with just a common *prefix* - e.g., "a.b, a.c" - are fine.
@pytest.mark.xfail(reason="ProjectionExpression does not yet support attribute paths")
def test_projection_expression_path_overlap(test_table_s):
# The overlap is tested symbolically, on the given paths, without any
# relation to what the item contains, or whether it even exists. So we
@@ -207,13 +221,57 @@ def test_projection_expression_path_overlap(test_table_s):
'a.b, a.b[2]',
'a.b, a.b.c',
'a, a.b[2].c',
'a.b.d, a.b',
'a.b.d.e, a.b',
'a.b, a.b.d',
'a.b, a.b.d.e',
]:
with pytest.raises(ClientError, match='ValidationException.* overlap'):
print(expr)
test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression=expr)
# The checks above can be easily passed by an over-zealos "overlap" check
# which declares everything an overlap :-) Let's also check some non-
# overlap cases - which shouldn't be declared an overlap.
for expr in ['a, b',
'a.b, a.c',
'a.b.d, a.b.e',
'a[1], a[2]',
'a.b, a.c[2]',
]:
print(expr)
test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression=expr)
# In addition to not allowing "overlapping" paths, DynamoDB also does not
# allow "conflicting" paths: It does not allow giving both a.b and a[1] in a
# single ProjectionExpression. It gives the error:
# "Invalid ProjectionExpression: Two document paths conflict with each other;
# must remove or rewrite one of these paths; path one: [a, b], path two:
# [a, [1]]".
# The reasoning is that asking for both in one request makes no sense because
# no item will ever be able to fulfill both.
def test_projection_expression_path_conflict(test_table_s):
# The conflict is tested symbolically, on the given paths, without any
# relation to what the item contains, or whether it even exists. So we
# don't even need to create an item for this test. We still need a
# key for the GetItem call :-)
p = random_string()
for expr in ['a.b, a[1]',
'a[1], a.b',
'a.b[1], a.b.c',
'a.b.c, a.b[1]',
]:
with pytest.raises(ClientError, match='ValidationException.* conflict'):
test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression=expr)
# The checks above can be easily passed by an over-zealos "conflict" check
# which declares everything a conflict :-) Let's also check some non-
# conflict cases - which shouldn't be declared a conflict.
for expr in ['a.b, a.c',
'a.b, a.c[1]',
]:
test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression=expr)
# Above we nested paths in ProjectionExpression, but just for the GetItem
# request. Let's verify they also work in Query and Scan requests:
@pytest.mark.xfail(reason="ProjectionExpression does not yet support attribute paths")
def test_query_projection_expression_path(test_table):
p = random_string()
items = [{'p': p, 'c': str(i), 'a': {'x': str(i*10), 'y': 'hi'}, 'b': 'hello' } for i in range(10)]
@@ -224,7 +282,6 @@ def test_query_projection_expression_path(test_table):
expected_items = [{'a': {'x': x['a']['x']}} for x in items]
assert multiset(expected_items) == multiset(got_items)
@pytest.mark.xfail(reason="ProjectionExpression does not yet support attribute paths")
def test_scan_projection_expression_path(test_table):
# This test is similar to test_query_projection_expression_path above,
# but uses a scan instead of a query. The scan will generate unrelated
@@ -241,9 +298,8 @@ def test_scan_projection_expression_path(test_table):
# BatchGetItem also supports ProjectionExpression, let's test that it
# applies to all items, and that it correctly suports document paths as well.
@pytest.mark.xfail(reason="ProjectionExpression does not yet support attribute paths")
def test_batch_get_item_projection_expression_path(test_table_s):
items = [{'p': random_string(), 'a': {'b': random_string()}, 'c': random_string()} for i in range(3)]
items = [{'p': random_string(), 'a': {'b': random_string(), 'x': 'hi'}, 'c': random_string()} for i in range(3)]
with test_table_s.batch_writer() as batch:
for item in items:
batch.put_item(item)
@@ -285,3 +341,20 @@ def test_projection_expression_and_key_condition_expression(test_table_s):
ExpressionAttributeNames={'#name1': 'p', '#name2': 'a'},
ExpressionAttributeValues={':val1': p});
assert got_items == [{'a': 'hello'}]
# Test whether the nesting depth of an a path in a projection expression
# is limited. If the implementation is done using recursion, it is goood
# practice to limit it and not crash the server. According to the DynamoDB
# documentation, DynamoDB supports nested attributes up to 32 levels deep:
# https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html#limits-attributes-nested-depth
# There is no reason why Alternator should not use exactly the same limit
# as is officially documented by DynamoDB.
def test_projection_expression_path_nesting_levels(test_table_s):
p = random_string()
# 32 nesting levels (including the top-level attribute) work
test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a'+('.b'*31))
# 33 nesting levels do not. DynamoDB gives an error: "Invalid
# ProjectionExpression: The document path has too many nesting levels;
# nesting levels: 33".
with pytest.raises(ClientError, match='ValidationException.*nesting levels'):
test_table_s.get_item(Key={'p': p}, ConsistentRead=True, ProjectionExpression='a'+('.b'*32))

View File

@@ -346,7 +346,6 @@ def test_update_item_returnvalues_updated_new(test_table_s):
# Test the ReturnValues from an UpdateItem directly modifying a *nested*
# attribute, in the relevant ReturnValue modes:
@pytest.mark.xfail(reason="nested updates not yet implemented")
def test_update_item_returnvalues_nested(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': {'b': 'dog', 'c': [1, 2, 3]}, 'd': 'cat'})
@@ -398,3 +397,37 @@ def test_update_item_returnvalues_nested(test_table_s):
UpdateExpression='SET a.c[1] = :val, a.b = :val2',
ExpressionAttributeValues={':val': 3, ':val2': 'dog'})
assert ret['Attributes'] == {'p': p, 'a': {'b': 'dog', 'c': [1, 3, 3]}, 'd': 'cat' }
# Test with REMOVE, and one of them doing nothing (so shouldn't be in UPDATED_OLD)
ret=test_table_s.update_item(Key={'p': p}, ReturnValues='UPDATED_OLD',
UpdateExpression='REMOVE a.c[1], a.c[3]')
assert ret['Attributes'] == {'a': {'c': [3]}} # a.c[3] did not exist
# When adding a new sub-attribute, UPDATED_OLD does not return anything,
# but UPDATED_NEW does:
ret=test_table_s.update_item(Key={'p': p}, ReturnValues='UPDATED_OLD',
UpdateExpression='SET a.x1 = :val',
ExpressionAttributeValues={':val': 8})
assert not 'Attributes' in ret
ret=test_table_s.update_item(Key={'p': p}, ReturnValues='UPDATED_NEW',
UpdateExpression='SET a.x2 = :val',
ExpressionAttributeValues={':val': 8})
assert ret['Attributes'] == {'a': {'x2': 8}}
# Nevertheless, there is one strange exception - although setting an array
# element *beyond* its end (e.g., a.c[100]) does add a new array item, it
# is *not* returned by UPDATED_NEW. I am not sure DynamoDB did this
# deliberately (is it a bug or a feature?), but it simplifies our
# implementation as well: after setting a.c[100], a.c[100] is not actually
# set (instead, maybe a.c[4] was set) so UPDATED_NEW returning a.c[100]
# returns nothing.
ret=test_table_s.update_item(Key={'p': p}, ReturnValues='UPDATED_NEW',
UpdateExpression='SET a.c[100] = :val',
ExpressionAttributeValues={':val': 70})
assert not 'Attributes' in ret
# When removing an item, it shouldn't appear in UPDATED_NEW. Again there
# a strange exception - which I'm not sure if we should consider it a
# DynamoDB bug or feature - but it simplifies our own implementation as
# well: if we remove a.c[1], the new item *will* have a new a.c[1] (the
# previous a.c[2] is shifted back), so this value is returned. This is
# very odd, but this is what DynamoDB does, so we should too...
ret=test_table_s.update_item(Key={'p': p}, ReturnValues='UPDATED_NEW',
UpdateExpression='REMOVE a.c[1]')
assert ret['Attributes'] == {'a': {'c': [70]}}

View File

@@ -265,16 +265,64 @@ def test_update_expression_multi_overlap(test_table_s):
# The problem isn't just with identical paths - we can't modify two paths that
# "overlap" in the sense that one is the ancestor of the other.
@pytest.mark.xfail(reason="nested updates not yet implemented")
def test_update_expression_multi_overlap_nested(test_table_s):
p = random_string()
with pytest.raises(ClientError, match='ValidationException.*overlap'):
test_table_s.update_item(Key={'p': p}, UpdateExpression='SET a = :val1, a.b = :val2',
ExpressionAttributeValues={':val1': {'b': 7}, ':val2': 'there'})
test_table_s.put_item(Item={'p': p, 'a': {'b': {'c': 2}}})
with pytest.raises(ClientError, match='ValidationException.*overlap'):
test_table_s.update_item(Key={'p': p}, UpdateExpression='SET a.b = :val1, a.b.c = :val2',
ExpressionAttributeValues={':val1': 'hi', ':val2': 'there'})
# Note that the overlap checks happen before checking the actual content
# of the item, so it doesn't matter that the item we want to modify
# doesn't even exist or have the structure referenced by the paths below.
for expr in ['SET a = :val1, a.b = :val2',
'SET a.b = :val1, a = :val2',
'SET a.b = :val1, a.b = :val2',
'SET a.b = :val1, a.b.c = :val2',
'SET a.b.c = :val1, a.b = :val2',
'SET a.b.c = :val1, a.b.c = :val2',
'SET a.b = :val1, a.b.c.d.e = :val2',
'SET a.b.c.d.e = :val1, a.b = :val2',
'SET a = :val1, a[1] = :val2',
'SET a[1] = :val1, a = :val2',
'SET a[1] = :val1, a[1] = :val2',
'SET a[1][1] = :val1, a[1] = :val2',
'SET a[1] = :val1, a[1][1] = :val2',
'SET a[1][1] = :val1, a[1][1] = :val2',
'SET a[1][1][1][1] = :val1, a[1][1] = :val2',
'SET a[1][1] = :val1, a[1][1][1][1] = :val2',
'SET a[1][1][1][1] = :val1, a[1][1][1][1] = :val2',
]:
print(expr)
with pytest.raises(ClientError, match='ValidationException.*overlap'):
test_table_s.update_item(Key={'p': p}, UpdateExpression=expr,
ExpressionAttributeValues={':val1': 2, ':val2': 'there'})
# Obviously this test can trivially pass if overlap checks wrongly labels
# everything as an overlap. So the test test_update_expression_multi_nested
# below is important - it confirms that we can do multiple modifications
# to the same item when they do not overlap.
# Besides the concept of "overlapping" paths tested above, DynamoDB also has
# the concept of "conflicting" paths - e.g., attempting to set both a.b and
# a[1] together doesn't make sense.
def test_update_expression_multi_conflict_nested(test_table_s):
p = random_string()
for expr in ['SET a.b = :val1, a[1] = :val2',
'SET a.b.c = :val1, a.b[2] = :val2',
]:
print(expr)
with pytest.raises(ClientError, match='ValidationException.*conflict'):
test_table_s.update_item(Key={'p': p}, UpdateExpression=expr,
ExpressionAttributeValues={':val1': 2, ':val2': 'there'})
# We can do several non-overlapping modifications to the same top-level
# attribute and to different top-level attributes in the same update
# expression.
def test_update_expression_multi_nested(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': {'x': 3, 'y': 4, 'c': {'y': 3}}, 'b': {'x': 1, 'y': 2}})
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET a.b = :val1, a.c.d = :val2, b.x = :val3 REMOVE a.x, b.y',
ExpressionAttributeValues={':val1': 10, ':val2': 'dog', ':val3': 17})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {
'p': p,
'a': {'y': 4, 'b': 10, 'c': {'y': 3, 'd': 'dog'}},
'b': {'x': 17}}
# In the previous test we saw that *modifying* the same item twice in the same
# update is forbidden; But it is allowed to *read* an item in the same update
@@ -776,10 +824,19 @@ def test_update_expression_dot_in_name(test_table_s):
ExpressionAttributeNames={'#a': 'a.b'})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'a.b': 3}
# Below we have several tests of what happens when a nested attribute is
# on the left-hand side of an assignment, but an every simpler case of
# nested attributes is having one on the right hand side of an assignment:
def test_update_expression_nested_attribute_rhs(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': {'b': 3, 'c': {'x': 7, 'y': 8}}, 'd': 5})
test_table_s.update_item(Key={'p': p}, UpdateExpression='SET z = a.c.x')
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 7
# A basic test for direct update of a nested attribute: One of the top-level
# attributes is itself a document, and we update only one of that document's
# nested attributes.
@pytest.mark.xfail(reason="nested updates not yet implemented")
def test_update_expression_nested_attribute_dot(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': {'b': 3, 'c': 4}, 'd': 5})
@@ -795,7 +852,6 @@ def test_update_expression_nested_attribute_dot(test_table_s):
# Similar test, for a list: one of the top-level attributes is a list, we
# can update one of its items.
@pytest.mark.xfail(reason="nested updates not yet implemented")
def test_update_expression_nested_attribute_index(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': ['one', 'two', 'three']})
@@ -817,7 +873,6 @@ def test_update_expression_nested_attribute_index_reference(test_table_s):
# Test that just like happens in top-level attributes, also in nested
# attributes, setting them replaces the old value - potentially an entire
# nested document, by the whole value (which may have a different type)
@pytest.mark.xfail(reason="nested updates not yet implemented")
def test_update_expression_nested_different_type(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': {'b': 3, 'c': {'one': 1, 'two': 2}}})
@@ -827,7 +882,6 @@ def test_update_expression_nested_different_type(test_table_s):
# Yet another test of a nested attribute update. This one uses deeper
# level of nesting (dots and indexes), adds #name references to the mix.
@pytest.mark.xfail(reason="nested updates not yet implemented")
def test_update_expression_nested_deep(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': {'b': 3, 'c': ['hi', {'x': {'y': [3, 5, 7]}}]}})
@@ -841,20 +895,55 @@ def test_update_expression_nested_deep(test_table_s):
# A REMOVE operation can be used to remove nested attributes, and also
# individual list items.
@pytest.mark.xfail(reason="nested updates not yet implemented")
def test_update_expression_nested_remove(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': {'b': 3, 'c': ['hi', {'x': {'y': [3, 5, 7]}, 'q': 2}]}})
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE a.c[1].x.y[1], a.c[1].q')
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['a'] == {'b': 3, 'c': ['hi', {'x': {'y': [3, 7]}}]}
# Removing a list item beyond the end of the list (e.g., REMOVE a[17] when
# the list only has three items) is silently ignored.
def test_update_expression_nested_remove_list_item_after_end(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': [4, 5, 6]})
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE a[17]')
# If we remove a[1] and then change a[3], the index "3" refers to the position
# *before* the first removal.
def test_update_expression_nested_remove_list_item_original_number(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': [2, 3, 4, 5, 6, 7]})
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE a[1] SET a[3] = :val',
ExpressionAttributeValues={':val': 17})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['a'] == [2, 4, 17, 6, 7]
# The order of the operations doesn't matter
test_table_s.put_item(Item={'p': p, 'a': [2, 3, 4, 5, 6, 7]})
test_table_s.update_item(Key={'p': p}, UpdateExpression='SET a[3] = :val REMOVE a[1]',
ExpressionAttributeValues={':val': 17})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['a'] == [2, 4, 17, 6, 7]
# DynamoDB allows an empty map. So removing the only member from a map leaves
# behind an empty map.
def test_update_expression_nested_remove_singleton_map(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': {'b': 1}})
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE a.b')
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['a'] == {}
# DynamoDB allows an empty list. So removing the only member from a list leaves
# behind an empty list.
def test_update_expression_nested_remove_singleton_list(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': [1]})
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE a[0]')
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['a'] == []
# The DynamoDB documentation specifies: "When you use SET to update a list
# element, the contents of that element are replaced with the new data that
# you specify. If the element does not already exist, SET will append the
# new element at the end of the list."
# So if we take a three-element list a[7], and set a[7], the new element
# will be put at the end of the list, not position 7 specifically.
@pytest.mark.xfail(reason="nested updates not yet implemented")
def test_nested_attribute_update_array_out_of_bounds(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': ['one', 'two', 'three']})
@@ -864,9 +953,9 @@ def test_nested_attribute_update_array_out_of_bounds(test_table_s):
# The DynamoDB documentation also says: "If you add multiple elements
# in a single SET operation, the elements are sorted in order by element
# number.
test_table_s.update_item(Key={'p': p}, UpdateExpression='SET a[84] = :val1, a[37] = :val2',
ExpressionAttributeValues={':val1': 'a1', ':val2': 'a2'})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'a': ['one', 'two', 'three', 'hello', 'a2', 'a1']}
test_table_s.update_item(Key={'p': p}, UpdateExpression='SET a[84] = :val1, a[37] = :val2, a[17] = :val3, a[50] = :val4',
ExpressionAttributeValues={':val1': 'a1', ':val2': 'a2', ':val3': 'a3', ':val4': 'a4'})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'a': ['one', 'two', 'three', 'hello', 'a3', 'a2', 'a4', 'a1']}
# Test what happens if we try to write to a.b, which would only make sense if
# a were a nested document, but a doesn't exist, or exists and is NOT a nested
@@ -875,9 +964,6 @@ def test_nested_attribute_update_array_out_of_bounds(test_table_s):
# ClientError: An error occurred (ValidationException) when calling the
# UpdateItem operation: The document path provided in the update expression
# is invalid for update
# Because Scylla doesn't read before write, it cannot detect this as an error,
# so we'll probably want to allow for that possibility as well.
@pytest.mark.xfail(reason="nested updates not yet implemented")
def test_nested_attribute_update_bad_path_dot(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': 'hello', 'b': ['hi']})
@@ -890,17 +976,56 @@ def test_nested_attribute_update_bad_path_dot(test_table_s):
with pytest.raises(ClientError, match='ValidationException.*path'):
test_table_s.update_item(Key={'p': p}, UpdateExpression='SET c.c = :val1',
ExpressionAttributeValues={':val1': 7})
# Same errors for "remove" operation.
with pytest.raises(ClientError, match='ValidationException.*path'):
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE a.c')
with pytest.raises(ClientError, match='ValidationException.*path'):
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE b.c')
with pytest.raises(ClientError, match='ValidationException.*path'):
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE c.c')
# Same error when the item doesn't exist at all:
p = random_string()
with pytest.raises(ClientError, match='ValidationException.*path'):
test_table_s.update_item(Key={'p': p}, UpdateExpression='SET c.c = :val1',
ExpressionAttributeValues={':val1': 7})
# HOWEVER, although it *is* allowed to remove a random path from a non-
# existent item. I don't know why. See the next test -
# test_nested_attribute_remove_from_missing_item
# Though in the above test (test_nested_attribute_update_bad_path_dot) we
# showed that DynamoDB does not allow REMOVE x.y if attribute x doesn't
# exist - and generates a ValidationException, it turns out that if the
# entire item doesn't exist, then a REMOVE x.y is silently ignored.
# I don't understand why they did this.
@pytest.mark.xfail(reason="for unknown reason, DynamoDB allows REMOVE x.y when item doesn't exist")
def test_nested_attribute_remove_from_missing_item(test_table_s):
p = random_string()
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE x.y')
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE x[0]')
# Similarly for other types of bad paths - using [0] on something which
# isn't an array,
@pytest.mark.xfail(reason="nested updates not yet implemented")
# doesn't exist or isn't an array.
def test_nested_attribute_update_bad_path_array(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': 'hello'})
with pytest.raises(ClientError, match='ValidationException.*path'):
test_table_s.update_item(Key={'p': p}, UpdateExpression='SET a[0] = :val1',
ExpressionAttributeValues={':val1': 7})
with pytest.raises(ClientError, match='ValidationException.*path'):
test_table_s.update_item(Key={'p': p}, UpdateExpression='SET b[0] = :val1',
ExpressionAttributeValues={':val1': 7})
# Same errors for "remove" operation.
with pytest.raises(ClientError, match='ValidationException.*path'):
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE a[0]')
with pytest.raises(ClientError, match='ValidationException.*path'):
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE b[0]')
# Same error when the item doesn't exist at all:
p = random_string()
with pytest.raises(ClientError, match='ValidationException.*path'):
test_table_s.update_item(Key={'p': p}, UpdateExpression='SET b[0] = :val1',
ExpressionAttributeValues={':val1': 7})
# HOWEVER, although it *is* allowed to remove a random path from a non-
# existent item. I don't know why... See test_nested_attribute_remove_from_missing_item
# DynamoDB Does not allow empty sets.
# Trying to ask UpdateItem to put one of these in an attribute should be
@@ -921,4 +1046,27 @@ def test_update_expression_empty_attribute(test_table_s):
UpdateExpression='SET d = :v1, e = :v2, f = :v3, g = :v4',
ExpressionAttributeValues={':v1': [], ':v2': {}, ':v3': '', ':v4': b''})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'd': [], 'e': {}, 'f': '', 'g': b''}
#
# Verify which kind of update operations require a read-before-write (a.k.a
# read-modify-write, or RMW). We test this by using a table configured with
# "forbid_rmw" isolation mode and checking which writes succeed or pass.
# This is a Scylla-only test (the test_table_s_forbid_rmw implies scylla_only).
def test_update_expression_when_rmw(test_table_s_forbid_rmw):
table = test_table_s_forbid_rmw
p = random_string()
# A write with a RHS (right-hand side) being a constant from the query
# doesn't need RMW:
table.update_item(Key={'p': p},
UpdateExpression='SET a = :val',
ExpressionAttributeValues={':val': 3})
# But if the LHS (left-hand side) of the assignment is a document path,
# it *does* need RMW:
with pytest.raises(ClientError, match='ValidationException.*write isolation policy'):
table.update_item(Key={'p': p},
UpdateExpression='SET a.b = :val',
ExpressionAttributeValues={':val': 3})
assert table.get_item(Key={'p': p}, ConsistentRead=True)['Item']['a'] == 3
# A write with a path in the RHS of a SET also needs RMW
with pytest.raises(ClientError, match='ValidationException.*write isolation policy'):
table.update_item(Key={'p': p},
UpdateExpression='SET b = a')

View File

@@ -0,0 +1,165 @@
/*
* Copyright (C) 2021 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#define BOOST_TEST_MODULE core
#include <boost/test/unit_test.hpp>
#include <vector>
#include "cdc/generation.hh"
#include "test/lib/random_utils.hh"
namespace cdc {
size_t limit_of_streams_in_topology_description();
topology_description limit_number_of_streams_if_needed(topology_description&& desc);
} // namespace cdc
static cdc::topology_description create_description(const std::vector<size_t>& streams_count_per_vnode) {
std::vector<cdc::token_range_description> result;
result.reserve(streams_count_per_vnode.size());
size_t vnode_index = 0;
int64_t token = std::numeric_limits<int64_t>::min() + 100;
for (size_t streams_count : streams_count_per_vnode) {
std::vector<cdc::stream_id> streams(streams_count);
token += 500;
for (size_t idx = 0; idx < streams_count; ++idx) {
streams[idx] = cdc::stream_id{dht::token::from_int64(token), vnode_index};
token += 100;
}
token += 10000;
// sharding_ignore_msb should not matter for limit_number_of_streams_if_needed
// so we're using sharding_ignore_msb equal to 12.
result.push_back(
cdc::token_range_description{dht::token::from_int64(token), std::move(streams), uint8_t{12}});
++vnode_index;
}
return cdc::topology_description(std::move(result));
}
static void assert_streams_count(const cdc::topology_description& desc, const std::vector<size_t>& expected_count) {
BOOST_REQUIRE_EQUAL(expected_count.size(), desc.entries().size());
for (size_t idx = 0; idx < expected_count.size(); ++idx) {
BOOST_REQUIRE_EQUAL(expected_count[idx], desc.entries()[idx].streams.size());
}
}
static void assert_stream_ids_in_right_token_ranges(const cdc::topology_description& desc) {
dht::token start = desc.entries().back().token_range_end;
dht::token end = desc.entries().front().token_range_end;
for (auto& stream : desc.entries().front().streams) {
dht::token t = stream.token();
if (t > end) {
BOOST_REQUIRE(start < t);
} else {
BOOST_REQUIRE(t <= end);
}
}
for (size_t idx = 1; idx < desc.entries().size(); ++idx) {
for (auto& stream : desc.entries()[idx].streams) {
BOOST_REQUIRE(desc.entries()[idx - 1].token_range_end < stream.token());
BOOST_REQUIRE(stream.token() <= desc.entries()[idx].token_range_end);
}
}
}
cdc::stream_id get_stream(const std::vector<cdc::token_range_description>& entries, dht::token tok);
static void assert_random_tokens_mapped_to_streams_with_tokens_in_the_same_token_range(const cdc::topology_description& desc) {
for (size_t count = 0; count < 100; ++count) {
int64_t token_value = tests::random::get_int(std::numeric_limits<int64_t>::min(), std::numeric_limits<int64_t>::max());
dht::token t = dht::token::from_int64(token_value);
auto stream = get_stream(desc.entries(), t);
auto& e = desc.entries().at(stream.index());
BOOST_REQUIRE(std::find(e.streams.begin(), e.streams.end(), stream) != e.streams.end());
if (stream.index() != 0) {
BOOST_REQUIRE(t <= e.token_range_end);
BOOST_REQUIRE(t > desc.entries().at(stream.index() - 1).token_range_end);
}
}
}
BOOST_AUTO_TEST_CASE(test_cdc_generation_limitting_single_vnode_should_not_limit) {
cdc::topology_description given = create_description({cdc::limit_of_streams_in_topology_description()});
cdc::topology_description result = cdc::limit_number_of_streams_if_needed(std::move(given));
assert_streams_count(result, {cdc::limit_of_streams_in_topology_description()});
assert_stream_ids_in_right_token_ranges(result);
assert_random_tokens_mapped_to_streams_with_tokens_in_the_same_token_range(result);
}
BOOST_AUTO_TEST_CASE(test_cdc_generation_limitting_single_vnode_should_limit) {
cdc::topology_description given = create_description({cdc::limit_of_streams_in_topology_description() + 1});
cdc::topology_description result = cdc::limit_number_of_streams_if_needed(std::move(given));
assert_streams_count(result, {cdc::limit_of_streams_in_topology_description()});
assert_stream_ids_in_right_token_ranges(result);
assert_random_tokens_mapped_to_streams_with_tokens_in_the_same_token_range(result);
}
BOOST_AUTO_TEST_CASE(test_cdc_generation_limitting_multiple_vnodes_should_not_limit) {
size_t total = 0;
std::vector<size_t> streams_count_per_vnode;
size_t count_for_next_vnode = 1;
while (total + count_for_next_vnode <= cdc::limit_of_streams_in_topology_description()) {
streams_count_per_vnode.push_back(count_for_next_vnode);
total += count_for_next_vnode;
++count_for_next_vnode;
}
cdc::topology_description given = create_description(streams_count_per_vnode);
cdc::topology_description result = cdc::limit_number_of_streams_if_needed(std::move(given));
assert_streams_count(result, streams_count_per_vnode);
assert_stream_ids_in_right_token_ranges(result);
assert_random_tokens_mapped_to_streams_with_tokens_in_the_same_token_range(result);
}
BOOST_AUTO_TEST_CASE(test_cdc_generation_limitting_multiple_vnodes_should_limit) {
size_t total = 0;
std::vector<size_t> streams_count_per_vnode;
size_t count_for_next_vnode = 1;
while (total + count_for_next_vnode <= cdc::limit_of_streams_in_topology_description()) {
streams_count_per_vnode.push_back(count_for_next_vnode);
total += count_for_next_vnode;
++count_for_next_vnode;
}
streams_count_per_vnode.push_back(cdc::limit_of_streams_in_topology_description() - total + 1);
cdc::topology_description given = create_description(streams_count_per_vnode);
cdc::topology_description result = cdc::limit_number_of_streams_if_needed(std::move(given));
assert(streams_count_per_vnode.size() <= cdc::limit_of_streams_in_topology_description());
size_t per_vnode_limit = cdc::limit_of_streams_in_topology_description() / streams_count_per_vnode.size();
for (auto& count : streams_count_per_vnode) {
count = std::min(count, per_vnode_limit);
}
assert_streams_count(result, streams_count_per_vnode);
assert_stream_ids_in_right_token_ranges(result);
assert_random_tokens_mapped_to_streams_with_tokens_in_the_same_token_range(result);
}

View File

@@ -252,29 +252,46 @@ SEASTAR_THREAD_TEST_CASE(test_disallow_cdc_on_materialized_view) {
SEASTAR_THREAD_TEST_CASE(test_permissions_of_cdc_description) {
do_with_cql_env_thread([] (cql_test_env& e) {
auto test_table = [&e] (const sstring& table_name) {
auto assert_unauthorized = [&e] (const sstring& stmt) {
testlog.info("Must throw unauthorized_exception: {}", stmt);
BOOST_REQUIRE_THROW(e.execute_cql(stmt).get(), exceptions::unauthorized_exception);
};
e.require_table_exists("system_distributed", table_name).get();
const sstring full_name = "system_distributed." + table_name;
// Allow MODIFY, SELECT
e.execute_cql(format("INSERT INTO {} (time) VALUES (toTimeStamp(now()))", full_name)).get();
e.execute_cql(format("UPDATE {} SET expired = toTimeStamp(now()) WHERE time = toTimeStamp(now())", full_name)).get();
e.execute_cql(format("DELETE FROM {} WHERE time = toTimeStamp(now())", full_name)).get();
e.execute_cql(format("SELECT * FROM {}", full_name)).get();
// Disallow ALTER, DROP
assert_unauthorized(format("ALTER TABLE {} ALTER time TYPE blob", full_name));
assert_unauthorized(format("DROP TABLE {}", full_name));
auto assert_unauthorized = [&e] (const sstring& stmt) {
testlog.info("Must throw unauthorized_exception: {}", stmt);
BOOST_REQUIRE_THROW(e.execute_cql(stmt).get(), exceptions::unauthorized_exception);
};
test_table("cdc_streams_descriptions");
test_table("cdc_generation_descriptions");
auto full_name = [] (const sstring& table_name) {
return "system_distributed." + table_name;
};
const sstring generations = "cdc_generation_descriptions";
const sstring streams = "cdc_streams_descriptions_v2";
const sstring timestamps = "cdc_generation_timestamps";
for (auto& t : {generations, streams, timestamps}) {
e.require_table_exists("system_distributed", t).get();
// Disallow DROP
assert_unauthorized(format("DROP TABLE {}", full_name(t)));
// Allow SELECT
e.execute_cql(format("SELECT * FROM {}", full_name(t))).get();
}
// Disallow ALTER
for (auto& t : {generations, streams}) {
assert_unauthorized(format("ALTER TABLE {} ALTER time TYPE blob", full_name(t)));
}
assert_unauthorized(format("ALTER TABLE {} ALTER key TYPE blob", full_name(timestamps)));
// Allow DELETE
for (auto& t : {generations, streams}) {
e.execute_cql(format("DELETE FROM {} WHERE time = toTimeStamp(now())", full_name(t))).get();
}
e.execute_cql(format("DELETE FROM {} WHERE key = 'timestamps'", full_name(timestamps))).get();
// Allow UPDATE, INSERT
e.execute_cql(format("UPDATE {} SET expired = toTimeStamp(now()) WHERE time = toTimeStamp(now())", full_name(generations))).get();
e.execute_cql(format("INSERT INTO {} (time) VALUES (toTimeStamp(now()))", full_name(generations))).get();
e.execute_cql(format("INSERT INTO {} (time, range_end) VALUES (toTimeStamp(now()), 0)", full_name(streams))).get();
e.execute_cql(format("UPDATE {} SET expired = toTimeStamp(now()) WHERE key = 'timestamps' AND time = toTimeStamp(now())", full_name(timestamps))).get();
}).get();
}

View File

@@ -166,7 +166,7 @@ SEASTAR_TEST_CASE(test_multishard_writer_producer_aborts) {
namespace {
class bucket_writer {
class test_bucket_writer {
schema_ptr _schema;
classify_by_timestamp _classify;
std::unordered_map<int64_t, std::vector<mutation>>& _buckets;
@@ -175,6 +175,17 @@ class bucket_writer {
mutation_opt _current_mutation;
bool _is_first_mutation = true;
size_t _throw_after;
size_t _mutation_consumed = 0;
public:
class expected_exception : public std::exception {
public:
virtual const char* what() const noexcept override {
return "expected_exception";
}
};
private:
void check_timestamp(api::timestamp_type ts) {
const auto bucket_id = _classify(ts);
@@ -223,40 +234,53 @@ private:
check_timestamp(rt.tomb.timestamp);
}
void maybe_throw() {
if (_mutation_consumed++ >= _throw_after) {
throw(expected_exception());
}
}
public:
bucket_writer(schema_ptr schema, classify_by_timestamp classify, std::unordered_map<int64_t, std::vector<mutation>>& buckets)
test_bucket_writer(schema_ptr schema, classify_by_timestamp classify, std::unordered_map<int64_t, std::vector<mutation>>& buckets, size_t throw_after = std::numeric_limits<size_t>::max())
: _schema(std::move(schema))
, _classify(std::move(classify))
, _buckets(buckets) {
}
, _buckets(buckets)
, _throw_after(throw_after)
{ }
void consume_new_partition(const dht::decorated_key& dk) {
maybe_throw();
BOOST_REQUIRE(!_current_mutation);
_current_mutation = mutation(_schema, dk);
}
void consume(tombstone partition_tombstone) {
maybe_throw();
BOOST_REQUIRE(_current_mutation);
verify_partition_tombstone(partition_tombstone);
_current_mutation->partition().apply(partition_tombstone);
}
stop_iteration consume(static_row&& sr) {
maybe_throw();
BOOST_REQUIRE(_current_mutation);
verify_static_row(sr);
_current_mutation->apply(mutation_fragment(*_schema, tests::make_permit(), std::move(sr)));
return stop_iteration::no;
}
stop_iteration consume(clustering_row&& cr) {
maybe_throw();
BOOST_REQUIRE(_current_mutation);
verify_clustering_row(cr);
_current_mutation->apply(mutation_fragment(*_schema, tests::make_permit(), std::move(cr)));
return stop_iteration::no;
}
stop_iteration consume(range_tombstone&& rt) {
maybe_throw();
BOOST_REQUIRE(_current_mutation);
verify_range_tombstone(rt);
_current_mutation->apply(mutation_fragment(*_schema, tests::make_permit(), std::move(rt)));
return stop_iteration::no;
}
stop_iteration consume_end_of_partition() {
maybe_throw();
BOOST_REQUIRE(_current_mutation);
BOOST_REQUIRE(_bucket_id);
auto& bucket = _buckets[*_bucket_id];
@@ -311,7 +335,7 @@ SEASTAR_THREAD_TEST_CASE(test_timestamp_based_splitting_mutation_writer) {
auto consumer = [&] (flat_mutation_reader bucket_reader) {
return do_with(std::move(bucket_reader), [&] (flat_mutation_reader& rd) {
return rd.consume(bucket_writer(random_schema.schema(), classify_fn, buckets), db::no_timeout);
return rd.consume(test_bucket_writer(random_schema.schema(), classify_fn, buckets), db::no_timeout);
});
};
@@ -342,3 +366,53 @@ SEASTAR_THREAD_TEST_CASE(test_timestamp_based_splitting_mutation_writer) {
}
}
SEASTAR_THREAD_TEST_CASE(test_timestamp_based_splitting_mutation_writer_abort) {
auto random_spec = tests::make_random_schema_specification(
get_name(),
std::uniform_int_distribution<size_t>(1, 4),
std::uniform_int_distribution<size_t>(2, 4),
std::uniform_int_distribution<size_t>(2, 8),
std::uniform_int_distribution<size_t>(2, 8));
auto random_schema = tests::random_schema{tests::random::get_int<uint32_t>(), *random_spec};
testlog.info("Random schema:\n{}", random_schema.cql());
auto ts_gen = [&, underlying = tests::default_timestamp_generator()] (std::mt19937& engine,
tests::timestamp_destination ts_dest, api::timestamp_type min_timestamp) -> api::timestamp_type {
if (ts_dest == tests::timestamp_destination::partition_tombstone ||
ts_dest == tests::timestamp_destination::row_marker ||
ts_dest == tests::timestamp_destination::row_tombstone ||
ts_dest == tests::timestamp_destination::collection_tombstone) {
if (tests::random::get_int<int>(0, 10, engine)) {
return api::missing_timestamp;
}
}
return underlying(engine, ts_dest, min_timestamp);
};
auto muts = tests::generate_random_mutations(random_schema, ts_gen).get0();
auto classify_fn = [] (api::timestamp_type ts) {
return int64_t(ts % 2);
};
std::unordered_map<int64_t, std::vector<mutation>> buckets;
int throw_after = tests::random::get_int(muts.size() - 1);
testlog.info("Will raise exception after {}/{} mutations", throw_after, muts.size());
auto consumer = [&] (flat_mutation_reader bucket_reader) {
return do_with(std::move(bucket_reader), [&] (flat_mutation_reader& rd) {
return rd.consume(test_bucket_writer(random_schema.schema(), classify_fn, buckets, throw_after), db::no_timeout);
});
};
try {
segregate_by_timestamp(flat_mutation_reader_from_mutations(tests::make_permit(), muts), classify_fn, std::move(consumer)).get();
} catch (const test_bucket_writer::expected_exception&) {
BOOST_TEST_PASSPOINT();
} catch (const seastar::broken_promise&) {
// Tolerated until we properly abort readers
BOOST_TEST_PASSPOINT();
}
}

View File

@@ -134,7 +134,7 @@ SEASTAR_TEST_CASE(test_querying_with_consumer) {
auto& db = e.local_db();
auto s = db.find_schema("ks", "cf");
e.local_qp().query("SELECT * from ks.cf", [&counter] (const cql3::untyped_result_set::row& row) mutable {
e.local_qp().query_internal("SELECT * from ks.cf", [&counter] (const cql3::untyped_result_set::row& row) mutable {
counter++;
return make_ready_future<stop_iteration>(stop_iteration::no);
}).get();
@@ -145,7 +145,7 @@ SEASTAR_TEST_CASE(test_querying_with_consumer) {
total += i;
e.local_qp().execute_internal("insert into ks.cf (k , v) values (?, ? );", { to_sstring(i), i}).get();
}
e.local_qp().query("SELECT * from ks.cf", [&counter, &sum] (const cql3::untyped_result_set::row& row) mutable {
e.local_qp().query_internal("SELECT * from ks.cf", [&counter, &sum] (const cql3::untyped_result_set::row& row) mutable {
counter++;
sum += row.get_as<int>("v");
return make_ready_future<stop_iteration>(stop_iteration::no);
@@ -158,7 +158,7 @@ SEASTAR_TEST_CASE(test_querying_with_consumer) {
total += i;
e.local_qp().execute_internal("insert into ks.cf (k , v) values (?, ? );", { to_sstring(i), i}).get();
}
e.local_qp().query("SELECT * from ks.cf", [&counter, &sum] (const cql3::untyped_result_set::row& row) mutable {
e.local_qp().query_internal("SELECT * from ks.cf", [&counter, &sum] (const cql3::untyped_result_set::row& row) mutable {
counter++;
sum += row.get_as<int>("v");
return make_ready_future<stop_iteration>(stop_iteration::no);
@@ -166,7 +166,7 @@ SEASTAR_TEST_CASE(test_querying_with_consumer) {
BOOST_CHECK_EQUAL(counter, 2200);
BOOST_CHECK_EQUAL(total, sum);
counter = 1000;
e.local_qp().query("SELECT * from ks.cf", [&counter] (const cql3::untyped_result_set::row& row) mutable {
e.local_qp().query_internal("SELECT * from ks.cf", [&counter] (const cql3::untyped_result_set::row& row) mutable {
counter++;
if (counter == 1010) {
return make_ready_future<stop_iteration>(stop_iteration::yes);

View File

@@ -911,8 +911,20 @@ SEASTAR_TEST_CASE(test_eviction_from_invalidated) {
std::vector<sstring> tmp;
auto alloc_size = logalloc::segment_size * 10;
while (tracker.region().occupancy().total_space() > alloc_size) {
tmp.push_back(uninitialized_string(alloc_size));
/*
* Now allocate huge chunks on the region until it gives up
* with bad_alloc. At that point the region must not have more
* memory than the chunk size, neither it must contain rows
* or partitions (except for dummy entries)
*/
try {
while (true) {
tmp.push_back(uninitialized_string(alloc_size));
}
} catch (const std::bad_alloc&) {
BOOST_REQUIRE(tracker.region().occupancy().total_space() < alloc_size);
BOOST_REQUIRE(tracker.get_stats().partitions == 0);
BOOST_REQUIRE(tracker.get_stats().rows == 0);
}
});
}

View File

@@ -49,6 +49,16 @@ def testTimeuuid(cql, test_keyspace):
for i in range(4):
uuid = rows[i][1]
datetime = datetime_from_uuid1(uuid)
# Before comparing this datetime to the result of dateOf(), we
# must truncate the resolution of datetime to milliseconds.
# he problem is that the dateOf(timeuuid) CQL function converts a
# timeuuid to CQL's "timestamp" type, which has millisecond
# resolution, but datetime *may* have finer resolution. It will
# usually be whole milliseconds, because this is what the now()
# implementation usually does, but when now() is called more than
# once per millisecond, it *may* start incrementing the sub-
# millisecond part.
datetime = datetime.replace(microsecond=datetime.microsecond//1000*1000)
timestamp = round(datetime.replace(tzinfo=timezone.utc).timestamp() * 1000)
assert_rows(execute(cql, table, "SELECT dateOf(t), unixTimestampOf(t) FROM %s WHERE k = 0 AND t = ?", rows[i][1]),
[datetime, timestamp])

View File

@@ -0,0 +1,46 @@
# Copyright 2021 ScyllaDB
#
# This file is part of Scylla.
#
# Scylla is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Scylla is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with Scylla. If not, see <http://www.gnu.org/licenses/>.
from cassandra.cluster import ConsistencyLevel
from cassandra.query import SimpleStatement
from util import new_test_table
def test_cdc_log_entries_use_cdc_streams(cql, test_keyspace):
'''Test that the stream IDs chosen for CDC log entries come from the CDC generation
whose streams are listed in the streams description table. Since this test is executed
on a single-node cluster, there is only one generation.'''
schema = "pk int primary key"
extra = " with cdc = {'enabled': true}"
with new_test_table(cql, test_keyspace, schema, extra) as table:
stmt = cql.prepare(f"insert into {table} (pk) values (?) using timeout 5m")
for i in range(100):
cql.execute(stmt, [i])
log_stream_ids = set(r[0] for r in cql.execute(f'select "cdc$stream_id" from {table}_scylla_cdc_log'))
# There should be exactly one generation, so we just select the streams
streams_desc = cql.execute(SimpleStatement(
'select streams from system_distributed.cdc_streams_descriptions_v2',
consistency_level=ConsistencyLevel.ONE))
stream_ids = set()
for entry in streams_desc:
stream_ids.update(entry.streams)
assert(log_stream_ids.issubset(stream_ids))

View File

@@ -63,3 +63,20 @@ def test_insert_null_key(cql, table1):
cql.execute(stmt, [s, None])
with pytest.raises(InvalidRequest, match='null value'):
cql.execute(stmt, [None, s])
def test_primary_key_in_null(cql, table1):
'''Tests handling of "key_column in ?" where ? is bound to null.'''
with pytest.raises(InvalidRequest, match='null value'):
cql.execute(cql.prepare(f"SELECT p FROM {table1} WHERE p IN ?"), [None])
with pytest.raises(InvalidRequest, match='null value'):
cql.execute(cql.prepare(f"SELECT p FROM {table1} WHERE p='' AND c IN ?"), [None])
with pytest.raises(InvalidRequest, match='Invalid null value for IN restriction'):
cql.execute(cql.prepare(f"SELECT p FROM {table1} WHERE p='' AND (c) IN ?"), [None])
# Cassandra says "IN predicates on non-primary-key columns (v) is not yet supported".
def test_regular_column_in_null(scylla_only, cql, table1):
'''Tests handling of "regular_column in ?" where ? is bound to null.'''
# Without any rows in the table, SELECT will shortcircuit before evaluating the WHERE clause.
cql.execute(f"INSERT INTO {table1} (p,c) VALUES ('p', 'c')")
with pytest.raises(InvalidRequest, match='null value'):
cql.execute(cql.prepare(f"SELECT v FROM {table1} WHERE v IN ? ALLOW FILTERING"), [None])

View File

@@ -136,7 +136,7 @@ def test_mix_per_query_timeout_with_other_params(scylla_only, cql, table1):
cql.execute(f"INSERT INTO {table} (p,c,v) VALUES ({key},1,1) USING TIMEOUT 60m AND TTL 1000000 AND TIMESTAMP 321")
cql.execute(f"INSERT INTO {table} (p,c,v) VALUES ({key},2,1) USING TIMESTAMP 42 AND TIMEOUT 30m")
res = list(cql.execute(f"SELECT ttl(v), writetime(v) FROM {table} WHERE p = {key} and c = 1"))
assert len(res) == 1 and res[0].ttl_v == 1000000 and res[0].writetime_v == 321
assert len(res) == 1 and res[0].ttl_v > 0 and res[0].writetime_v == 321
res = list(cql.execute(f"SELECT ttl(v), writetime(v) FROM {table} WHERE p = {key} and c = 2"))
assert len(res) == 1 and not res[0].ttl_v and res[0].writetime_v == 42

View File

@@ -568,7 +568,7 @@ public:
db::system_keyspace::init_local_cache().get();
auto stop_local_cache = defer([] { db::system_keyspace::deinit_local_cache().get(); });
sys_dist_ks.start(std::ref(qp), std::ref(mm)).get();
sys_dist_ks.start(std::ref(qp), std::ref(mm), std::ref(proxy)).get();
service::get_local_storage_service().init_server(service::bind_messaging_port(false)).get();
service::get_local_storage_service().join_cluster().get();

View File

@@ -37,13 +37,14 @@ cql_server::event_notifier::event_notifier(service::migration_notifier& mn) : _m
cql_server::event_notifier::~event_notifier()
{
service::get_local_storage_service().unregister_subscriber(this);
assert(_stopped);
}
future<> cql_server::event_notifier::stop() {
return _mnotifier.unregister_listener(this).then([this]{
return service::get_local_storage_service().unregister_subscriber(this).finally([this] {
_stopped = true;
});
});
}