Compare commits

...

104 Commits

Author SHA1 Message Date
Hagit Segev
f9dd8608eb release: prepare for 4.0.4 2020-07-14 14:10:39 +03:00
Avi Kivity
24a80cbf47 Update seastar submodule
* seastar a73b92ff2e...4ee384e15f (2):
  > futures: Add a test for a broken promise in a parallel_for_each
  > future: Call set_to_broken_promise earlier

Fixes #6749 (probably)
2020-07-13 20:32:27 +03:00
Dmitry Kropachev
6e4edc97ad dist/common/scripts/scylla-housekeeping: wrap urllib.request with try ... except
We could hit "cannot serialize '_io.BufferedReader' object" when request get 404 error from the server
	Now you will get legit error message in the case.

	Fixes #6690

(cherry picked from commit de82b3efae)
2020-07-09 18:25:35 +03:00
Dejan Mircevski
81df28b6f3 cql/restrictions: Handle WHERE a>0 AND a<0
WHERE clauses with start point above the end point were handled
incorrectly.  When the slice bounds are transformed to interval
bounds, the resulting interval is interpreted as wrap-around (because
start > end), so it contains all values above 0 and all values below
0.  This is clearly incorrect, as the user's intent was to filter out
all possible values of a.

Fix it by explicitly short-circuiting to false when start > end.  Add
a test case.

Fixes #5799.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
(cherry picked from commit 921dbd0978)
2020-07-08 13:25:06 +03:00
Juliusz Stasiewicz
ea6620e9eb counters: Read the state under timeout
Counter update is a RMW operation. Until now the "Read" part was
not guarded by a timeout, which is changed in this patch.

Fixes #5069

(cherry picked from commit e04fd9f774)
2020-07-07 20:45:26 +03:00
Takuya ASADA
19be84dafd scylla_setup: don't add same disk device twice
We shouldn't accept adding same disk twice for RAID prompt.

Fixes #6711

(cherry picked from commit 835e76fdfc)
2020-07-07 13:08:36 +03:00
Pavel Emelyanov
2ff897d351 main: Keep feature_service for storage_proxy
Fixes #6250

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200423165608.32419-1-xemul@scylladb.com>
(cherry picked from commit 98635b74a6)
2020-07-07 12:42:52 +03:00
Botond Dénes
8fc3300739 sstables: sstable_reader: fix read range upper bound calculation for reverse slices
The single-key sstable reader uses the clustering ranges from the slice
to determine the upper bound of the disk read-range using the index.
For this is simply uses the end bound of the last clustering ranges. For
reverse reads however the clustering ranges in the slice are in reverse
order, so this will in fact be the upper bound of the smallest range.
Depending on whether the distance between the clustering range is big
enough for the sstable reader to use the index to skip between them,
this will lead to either reading too little data or an assert failure.

This patch fixes the problematic function `get_slice_upper_bound()` to
consider reverse reads as well.

Initially I thought there will be more mishandling of reverse slices,
but actually `mutation_fragment_filter`, the component doing the actual
slicing of rows, is already reverse-slice aware.

A unit test which reproduces the assert failure is also added.

Fixes: #6171

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200507114956.271799-1-bdenes@scylladb.com>
(cherry picked from commit 791acc7f38)
2020-07-05 16:02:15 +03:00
Raphael S. Carvalho
d2ac7d4b18 compaction: Fix partition estimation with TWCS interposer
Max and min windows are microsecond timestamps, which should be divided
by window size in microseconds to properly estimate window count
based on provided mutation_source_metadata.

Found this problem after properly setting mutation_source_metadata with
min and max metadata on behalf of regular compaction.

Fixes #6214.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200409194235.6004-2-raphaelsc@scylladb.com>
(cherry picked from commit 3edff36cd2)
2020-07-05 15:27:40 +03:00
Avi Kivity
61706a6789 Update seastar submodule
* seastar 0dc0fec831...a73b92ff2e (1):
  > rpc::compressor: Fix static init fiasco with names

Fixes #5963
2020-07-02 18:08:52 +03:00
Piotr Sarna
65aa531010 db: set gc grace period to 0 for local system tables
Local system tables from `system` namespace use LocalStrategy
replication, so they do not need to be concerned about gc grace
period. Some system tables already set gc grace period to 0,
but other ones, including system.large_partitions, did not.
That may result in millions of tombstones being needlessly
kept for these tables, which can cause read timeouts.

Fixes #6325
Tests: unit(dev), local(running cqlsh and playing with system tables)

(cherry picked from commit bf5f247bc5)
2020-07-01 13:13:57 +03:00
Benny Halevy
4bffd0f522 api: storage_service: serialize true_snapshot_size
Following up on 91b71a0b1a
We also need to serialize storage_service::true_snapshots_size
with snapshot-modifying operations.

It seems like it was assumed that get_snapshot_details
is done under run_snapshot_list_operation, but the one called
here is the table method, not the api::storage_service::get_snapshot_details.

Fixes #5603

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200506115732.483966-1-bhalevy@scylladb.com>
(cherry picked from commit 682fb3acfd)
2020-07-01 13:09:43 +03:00
Rafael Ávila de Espíndola
9409fc7290 gms: Don't keep references to reallocated vector entries
These callbacks can block a seastar thread and the underlying vector
can be reallocated concurrently.

This is no different than if it was a plain std::vector and the
solution is similar: use values instead of references.

Fixes #6230

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200422182304.120906-1-espindola@scylladb.com>
(cherry picked from commit d8555513a9)
2020-07-01 12:58:56 +03:00
Pavel Solodovnikov
86faf1b3ca cql3: avoid using shared_ptr's in unrecognized_entity_exception
Using shared_ptr's in `unrecognized_entity_exception` can lead
to cross-cpu deletion of a pointer which will trigger an assert
`_cpu == std::this_thread::get_id()' when shared_ptr is disposed.

Copy `column_identifier` to the exception object and avoid using
an instance of `cql3::relation`: just get a string representation
from it since nothing more is used in associated exception
handling code.

Fixes: #6287
Tests: unit(dev, debug), dtest(lwt_destructive_ddl_test.py:LwtDestructiveDDLTest.test_rename_column)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200506155714.150497-1-pa.solodovnikov@scylladb.com>
(cherry picked from commit 1d3f9174c5)
2020-07-01 12:54:09 +03:00
Raphael S. Carvalho
426295bda9 compaction: Fix the 2x disk space requirement in SSTable upgrade
SSTable upgrade is requiring 2x the space of input SSTables because
we aren't releasing references of the SSTables that were already
upgraded. So if we're upgrading 1TB, it means that up to 2TB may be
required for the upgrade operation to succeed.

That can be fixed by moving all input SSTables when rewrite_sstables()
asks for the set of SSTables to be compacted, so allowing their space
to be released as soon as there is no longer any ref to them.

Spotted while auditting code.

Fixes #6682.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200619205701.92891-1-raphaelsc@scylladb.com>
(cherry picked from commit 52180f91d4)
2020-07-01 12:37:38 +03:00
Raphael S. Carvalho
c6fde0e562 cql3: don't reset default TTL when not explicitly specified in alter table statement
Any alter table statement that doesn't explicitly set the default time
to live will reset it to 0.

That can be very dangerous for time series use cases, which rely on
all data being eventually expired, and a default TTL of 0 means
data never being expired.

Fixes #5048.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200402211653.25603-1-raphaelsc@scylladb.com>
(cherry picked from commit 044f80b1b5)
2020-06-30 19:28:50 +03:00
Avi Kivity
d9f9e7455b Merge "Fix handling of decimals with negative scales" from Rafael
"
Before this series scylla would effectively infinite loop when, for
example, casting a decimal with a negative scale to float.

Fixes #6720
"

* 'espindola/fix-decimal-issue' of https://github.com/espindola/scylla:
  big_decimal: Add a test for a corner case
  big_decimal: Correctly handle negative scales
  big_decimal: Add a as_rational member function
  big_decimal: Move constructors out of line

(cherry picked from commit 3e2eeec83a)
2020-06-29 12:26:06 +03:00
Piotr Sarna
e95bcd0f8f alternator: fix propagating tags
Updating tags was erroneously done locally, which means that
the schema change was not propagated to other nodes.
The new code announces new schema globally.

Fixes #6513
Branches: 4.0,4.1
Tests: unit(dev)
       dtest(alternator_tests.AlternatorTest.test_update_condition_expression_and_write_isolation)
Message-Id: <3a816c4ecc33c03af4f36e51b11f195c231e7ce1.1592935039.git.sarna@scylladb.com>

(cherry picked from commit f4e8cfe03b)
2020-06-24 14:10:36 +03:00
Asias He
2ff6e2e122 streaming: Do not send end of stream in case of error
Current sender sends stream_mutation_fragments_cmd::end_of_stream to
receiver when an error is received from a peer node. To be safe, send
stream_mutation_fragments_cmd::error instead of
stream_mutation_fragments_cmd::end_of_stream to prevent end_of_stream to
be written into the sstable when a partition is not closed yet.

In addition, use mutation_fragment_stream_validator to valid the
mutation fragments emitted from the reader, e.g., check if
partition_start and partition_end are paired when the reader is done. If
not, fail the stream session and send
stream_mutation_fragments_cmd::error instead of
stream_mutation_fragments_cmd::end_of_stream to isolate the problematic
sstables on the sender node.

Refs: #6478
(cherry picked from commit a521c429e1)
2020-06-23 12:48:01 +03:00
Hagit Segev
1fcf38abd9 release: prepare for 4.0.3 2020-06-21 21:46:49 +03:00
Alejo Sanchez
3375b8b86c lwt: validate before constructing metadata
LWT batches conditions can't span multiple tables.
This was detected in batch_statement::validate() called in ::prepare().
But ::cas_result_set_metadata() was built in the constructor,
causing a bitset assert/crash in a reported scenario.
This patch moves validate() to the constructor before building metadata.

Closes #6332

Tested with https://github.com/scylladb/scylla-dtest/pull/1465

[avi: adjust spelling of exception message to 4.0 spelling]

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit d1521e6721)
2020-06-21 18:22:08 +03:00
Gleb Natapov
586546ab32 cql transport: do not log broken pipe error when a client closes its side of a connection abruptly
Fixes #5661

Message-Id: <20200615075958.GL335449@scylladb.com>
(cherry picked from commit 7ca937778d)
2020-06-21 13:09:10 +03:00
Amnon Heiman
e1d558cb01 api/storage_service.cc: stream result of token_range
The get token range API can become big which can cause large allocation
and stalls.

This patch replace the implementation so it would stream the results
using the http stream capabilities instead of serialization and sending
one big buffer.

Fixes #6297

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 7c4562d532)
2020-06-21 12:57:34 +03:00
Avi Kivity
b0a8f396b4 Update seastar submodule
* seastar 447aad8d78...0dc0fec831 (1):
  > membarrier: fix madvise(MADV_DONTNEED) failure and crash with --lock-memory

Fixes #6346.
2020-06-21 12:35:39 +03:00
Rafael Ávila de Espíndola
48e7ee374a configure: Reduce the dynamic linker path size
gdb has a SO_NAME_MAX_PATH_SIZE of 512, so we use that as the path
size.

Fixes: #6494

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200528202741.398695-2-espindola@scylladb.com>
(cherry picked from commit aa778ec152)
2020-06-21 12:27:19 +03:00
Piotr Sarna
3e85ecd1bd alternator: fix the return type of PutItem
Even if there are no attributes to return from PutItem requests,
we should return a valid JSON object, not an empty string.

Fixes #6568
Tests: unit(dev)

(cherry picked from commit 8fc3ca855e)
2020-06-21 12:21:30 +03:00
Piotr Sarna
930a4af8b3 alternator: fix returning UnprocessedKeys unconditionally
Client libraries (e.g. PynamoDB) expect the UnprocessedKeys
and UnprocessedItems attributes to appear in the response
unconditionally - it's hereby added, along with a simple test case.

Fixes #6569
Tests: unit(dev)

(cherry picked from commit 3aff52f56e)
2020-06-21 12:19:34 +03:00
Tomasz Grabiec
6a6d36058a row_cache: Fix undefined behavior on key linearization
This is relevant only when using partition or clustering keys which
have a representation in memory which is larger than 12.8 KB (10% of
LSA segment size).

There are several places in code (cache, background garbage
collection) which may need to linearize keys because of performing key
comparison, but it's not done safely:

 1) the code does not run with the LSA region locked, so pointers may
get invalidated on linearization if it needs to reclaim memory. This
is fixed by running the code inside an allocating section.

 2) LSA region is locked, but the scope of
with_linearized_managed_bytes() encloses the allocating section. If
allocating section needs to reclaim, linearization context will
contain invalidated pointers. The fix is to reorder the scopes so
that linearization context lives within an allocating section.

Example of 1 can be found in
range_populating_reader::handle_end_of_stream() where it performs a
lookup:

  auto prev = std::prev(it);
  if (prev->key().equal(*_cache._schema, *_last_key->_key)) {
     it->set_continuous(true);

but handle_end_of_stream() is not invoked under allocating section.

Example of 2 can be found in mutation_cleaner_impl::merge_some() where
it does:

  return with_linearized_managed_bytes([&] {
  ...
    return _worker_state->alloc_section(region, [&] {

Fixes #6637.
Refs #6108.

Tests:

  - unit (all)

Message-Id: <1592218544-9435-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit e81fc1f095)
2020-06-21 11:58:39 +03:00
Yaron Kaikov
ce57d0174d release: prepare for 4.0.2 2020-06-15 20:52:58 +03:00
Avi Kivity
cd11f210ad tools: toolchain: regenerate for gnutls 3.6.14
CVE-2020-13777.

Fixes #6627.

Toolchain source image registry disambiguated due to tighter podman defaults.
2020-06-15 07:58:31 +03:00
Calle Wilund
1e2e203cf0 gms::inet_address: Fix sign extension error in custom address formatting
Fixes #5808

Seems some gcc:s will generate the code as sign extending. Mine does not,
but this should be more correct anyhow.

Added small stringify test to serialization_test for inet_address

(cherry picked from commit a14a28cdf4)
2020-06-09 20:16:37 +03:00
Takuya ASADA
1a98c93a25 aws: update enhanced networking supported instance list
Sync enhanced networking supported instance list to latest one.

Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html

Fixes #6540

(cherry picked from commit 969c4258cf)
2020-06-09 16:02:27 +03:00
Calle Wilund
4f4845c94c commitlog_test: Ensure "when_over_disk_limit" reads segment list only once
Fixes #6195

test_commitlog_delete_when_over_disk_limit reads current segment list
in flush handler, to compare with result after allowing deletetion of
segement. However, it might be called more than once in rare cases,
because timing and us using rather small sizes.

Reading the list the second time however is not a good idea, because
it might just very well be exactly the same as what we read in the
test check code, and we actually overwrite the list we want to
check against. Because callback is on timer. And test is not.

Message-Id: <20200414114322.13268-1-calle@scylladb.com>
[ penberg: backported fix random failures in commitlog_test ]
(cherry picked from commit a62d75fed5)
2020-06-01 18:41:18 +03:00
Nadav Har'El
ef745e1ce7 alternator: fix support for bytes type in Query's KeyConditions
Our parsing of values in a KeyConditions paramter of Query was done naively.
As a result, we got bizarre error messages "condition not met: false" when
these values had incorrect type (this is issue #6490). Worse - the naive
conversion did not decode base64-encoded bytes value as needed, so
KeyConditions on bytes-typed keys did not work at all.

This patch fixes these bugs by using our existing utility function
get_key_from_typed_value(), which takes care of throwing sensible errors
when types don't match, and decoding base64 as needed.

Unfortunately, we didn't have test coverage for many of the KeyConditions
features including bytes keys, which is why this issue escaped detection.
A patch will follow with much more comprehensive tests for KeyConditions,
which also reproduce this issue and verify that it is fixed.

Refs #6490
Fixes #6495

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200524141800.104950-1-nyh@scylladb.com>
(cherry picked from commit 6b38126a8f)
2020-05-31 14:02:18 +03:00
Calle Wilund
ae32aa970a commitlog::read_log_file: Preserve subscription across reading
Fixes #6265

Return type for read_log_file was previously changed from
subscription to future<>, returning the previously returned
subscriptions result of done(). But it did not preserve the
subscription itself, which in turn will cause us to (in
work::stream), call back into a deleted object.

Message-Id: <20200422090856.5218-1-calle@scylladb.com>
(cherry picked from commit 525b283326)
2020-05-25 13:07:33 +03:00
Eliran Sinvani
a3eb12c5f1 Auth: return correct error code when role is not found
Scylla returns the wrong error code (0000 - server internal error)
in response to trying to do authentication/authorization operations
that involves a non-existing role.
This commit changes those cases to return error code 2200 (invalid
query) which is the correct one and also the one that Cassandra
returns.
Tests:
    Unit tests (Dev)
    All auth and auth_role dtests

(cherry picked from commit ce8cebe34801f0ef0e327a32f37442b513ffc214)

Fixes #6363.
2020-05-25 12:58:09 +03:00
Amnon Heiman
b5cedfc177 storage_service: get_range_to_address_map prevent use after free
The implementation of get_range_to_address_map has a default behaviour,
when getting an empty keypsace, it uses the first non-system keyspace
(first here is basically, just a keyspace).

The current implementation has two issues, first, it uses a reference to
a string that is held on a stack of another function. In other word,
there's a use after free that is not clear why we never hit.

The second, it calls get_non_system_keyspaces twice. Though this is not
a bug, it's redundant (get_non_system_keyspaces uses a loop, so calling
that function does have a cost).

This patch solves both issues, by chaning the implementation to hold a
string instead of a reference to a string.

Second, it stores the results from get_non_system_keyspaces and reuse
them it's more efficient and holds the returned values on the local
stack.

Fixes #6465

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 69a46d4179)
2020-05-25 12:48:26 +03:00
Hagit Segev
8d9bc57aca release: prepare for 4.0.1 2020-05-24 21:39:44 +03:00
Tomasz Grabiec
1cbda629a2 sstables: index_reader: Fix overflow when calculating promoted index end
When index file is larger than 4GB, offset calculation will overflow
uint32_t and _promoted_index_end will be too small.

As a result, promoted_index_size calculation will underflow and the
rest of the page will be interpretd as a promoted index.

The partitions which are in the remainder of the index page will not
be found by single-partition queries.

Data is not lost.

Introduced in 6c5f8e0eda.

Fixes #6040
Message-Id: <20200521174822.8350-1-tgrabiec@scylladb.com>

(cherry picked from commit a6c87a7b9e)
2020-05-24 09:45:55 +03:00
Rafael Ávila de Espíndola
baf0201a6e repair: Make sure sinks are always closed
In a recent next failure I got the following backtrace

    function=function@entry=0x270360 "seastar::rpc::sink_impl<Serializer, Out>::~sink_impl() [with Serializer = netw::serializer; Out = {repair_row_on_wire_with_cmd}]") at assert.c:101
    at ./seastar/include/seastar/core/shared_ptr.hh:463
    at repair/row_level.cc:2059

This patch changes a few functions to use finally to make sure the sink
is always closed.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200515202803.60020-1-espindola@scylladb.com>
(cherry picked from commit 311fbe2f0a)

Ref #6414
2020-05-20 09:00:44 +03:00
Asias He
7dcffb963c repair: Fix race between write_end_of_stream and apply_rows
Consider: n1, n2, n1 is the repair master, n2 is the repair follower.

=== Case 1 ===
1) n1 sends missing rows {r1, r2} to n2
2) n2 runs apply_rows_on_follower to apply rows, e.g., {r1, r2}, r1
   is written to sstable, r2 is not written yet, r1 belongs to
   partition 1, r2 belongs to partition 2. It yields after row r1 is
   written.
   data: partition_start, r1
3) n1 sends repair_row_level_stop to n2 because error has happened on n1
4) n2 calls wait_for_writer_done() which in turn calls write_end_of_stream()
   data: partition_start, r1, partition_end
5) Step 2 resumes to apply the rows.
   data: partition_start, r1, partition_end, partition_end, partition_start, r2

=== Case 2 ===
1) n1 sends missing rows {r1, r2} to n2
2) n2 runs apply_rows_on_follower to apply rows, e.g., {r1, r2}, r1
   is written to sstable, r2 is not written yet, r1 belongs to partition
   1, r2 belongs to partition 2. It yields after partition_start for r2
   is written but before _partition_opened is set to true.
   data: partition_start, r1, partition_end, partition_start
3) n1 sends repair_row_level_stop to n2 because error has happened on n1
4) n2 calls wait_for_writer_done() which in turn calls write_end_of_stream().
   Since _partition_opened[node_idx] is false, partition_end is skipped,
   end_of_stream is written.
   data: partition_start, r1, partition_end, partition_start, end_of_stream

This causes unbalanced partition_start and partition_end in the stream
written to sstables.

To fix, serialize the write_end_of_stream and apply_rows with a semaphore.

Fixes: #6394
Fixes: #6296
Fixes: #6414
(cherry picked from commit b2c4d9fdbc)
2020-05-20 08:08:11 +03:00
Piotr Dulikowski
dcfaf4d035 hinted handoff: don't keep positions of old hints in rps_set
When sending hints from one file, rps_set field in send_one_file_ctx
keeps track of commitlog positions of hints that are being currently
sent, or have failed to be sent. At the end of the operation, if sending
of some hints failed, we will choose position of the earliest hint that
failed to be sent, and will retry sending that file later, starting from
that position. This position is stored in _last_not_complete_rp.

Usually, this set has a bounded size, because we impose a limit of at
most 128 hints being sent concurrently. Because we do not attempt to
send any more hints after a failure is detected, rps_set should not have
more than 128 elements at a time.

Due to a bug, commitlog positions of old hints (older than
gc_grace_seconds of the destination table) were inserted into rps_set
but not removed after checking their age. This could cause rps_set to
grow very large when replaying a file with old hints.

Moreover, if the file mixed expired and non-expired hints (which could
happen if it had hints to two tables with different gc_grace_seconds),
and sending of some non-expired hints failed, then positions of expired
hints could influence calculation _last_not_complete_rp, and more hints
than necessary would be resent on the next retry.

This simple patch removes commitlog position of a hint from rps_set when
it is detected to be too old.

Fixes #6422

(cherry picked from commit 85d5c3d5ee)
2020-05-20 08:06:04 +03:00
Piotr Dulikowski
f974a54cbd hinted handoff: remove discarded hint positions from rps_set
Related commit: 85d5c3d

When attempting to send a hint, an exception might occur that results in
that hint being discarded (e.g. keyspace or table of the hint was
removed).

When such an exception is thrown, position of the hint will already be
stored in rps_set. We are only allowed to retain positions of hints that
failed to be sent and needed to be retried later. Dropping a hint is not
an error, therefore its position should be removed from rps_set - but
current logic does not do that.

Because of that bug, hint files with many discardable hints might cause
rps_set to grow large when the file is replayed. Furthermore, leaving
positions of such hints in rps_set might cause more hints than necessary
to be re-sent if some non-discarded hints fail to be sent.

This commit fixes the problem by removing positions of discarded hints
from rps_set.

Fixes #6433

(cherry picked from commit 0c5ac0da98)
2020-05-20 08:03:44 +03:00
Piotr Sarna
30a96cc592 db, view: remove duplicate entries from pending endpoints
When generating view updates, an endpoint can appear both
as a primary paired endpoint for the view update, and as a pending
endpoint (due to range movements). In order not to generate
the same update twice for the same endpoint, the paired endpoint
is removed from the list of pending endpoints if present.

Fixes #5459
Tests: unit(dev),
       dtest(TestMaterializedViews.add_dc_during_mv_insert_test)

(cherry picked from commit 86b0dd81e3)
2020-05-17 19:09:58 +02:00
Avi Kivity
faf300382a Update seastar submodule
* seastar 8bc24f486a...447aad8d78 (1):
  > timer: add scheduling_group awareness

Fixes #6170.
2020-05-10 18:12:32 +03:00
Gleb Natapov
55400598ff storage_proxy: limit read repair only to replicas that answered during speculative reads
Speculative reader has more targets that needed for CL. In case there is
a digest mismatch the repair runs between all of them, but that violates
provided CL. The patch makes it so that repair runs only between
replicas that answered (there will be CL of them).

Fixes #6123

Reviewed-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200402132245.GA21956@scylladb.com>
(cherry picked from commit 36a24bbb70)
2020-05-07 19:48:24 +03:00
Mike Goltsov
c177295bce fix error in fstrim service (scylla_util.py)
On Centos 7 machine:

fstrim.timer not enabled, only unmasked due scylla_fstrim_setup on installation
When trying run scylla-fstrim service manually you get error:

Traceback (most recent call last):
File "/opt/scylladb/scripts/libexec/scylla_fstrim", line 60, in <module>
main()
File "/opt/scylladb/scripts/libexec/scylla_fstrim", line 44, in main
cfg = parse_scylla_dirs_with_default(conf=args.config)
File "/opt/scylladb/scripts/scylla_util.py", line 484, in parse_scylla_dirs_with_default
if key not in y or not y[k]:
NameError: name 'k' is not defined

It caused by error in scylla_util.py

Fixes #6294.

(cherry picked from commit 068bb3a5bf)
2020-05-07 19:45:35 +03:00
Hagit Segev
d95aa77b62 release: prepare for 4.0.0 2020-05-05 18:58:39 +03:00
Pekka Enberg
fe54009855 scripts/jobs: Keep memory reserve when calculating parallelism
The "jobs" script is used to determine the amount of compilation
parallelism on a machine. It attempts to ensure each GCC process has at
least 4 GB of memory per core. However, in the worst case scenario, we
could end up having the GCC processes take up all the system memory,
forcin swapping or OOM killer to kick in. For example, on a 4 core
machine with 16 GB of memory, this worst case scenario seems easy to
trigger in practice.

Fix up the problem by keeping a 1 GB of memory reserve for other
processes and calculating parallelism based on that.

Message-Id: <20200423082753.31162-1-penberg@scylladb.com>
(cherry picked from commit 7304a795e5)
2020-05-04 19:01:14 +03:00
Piotr Sarna
bbe82236be clocks-impl: switch to thread-safe time conversion
std::gmtime() has a sad property of using a global static buffer
for returning its value. This is not thread-safe, so its usage
is replaced with gmtime_r, which can accept a local buffer.
While no regressions where observed in this particular area of code,
a similar bug caused failures in alternator, so it's better to simply
replace all std::gmtime calls with their thread-safe counterpart.

Message-Id: <39e91c74de95f8313e6bb0b12114bf12c0e79519.1588589151.git.sarna@scylladb.com>
(cherry picked from commit 05ec95134a)
2020-05-04 17:14:28 +03:00
Piotr Sarna
abd73cab78 alternator: fix signature timestamps
Generating timestamps for auth signatures used a non-thread-safe
::gmtime function instead of thread-safe ::gmtime_r.

Tests: unit(dev)
Fixes #6345

(cherry picked from commit fb7fa7f442)
2020-05-04 17:05:39 +03:00
Nadav Har'El
8fd7cf5cd1 alternator test: drastically reduce time to boot Scylla
The alternator test, test/alternator/run, runs Scylla and runs the
various tests against it. Before this patch, just booting Scylla took
about 26 seconds (for a dev build, on my laptop). This patch reduces
this delay to less than one second!

It turns out that almost the entire delay was artificial, two periods
of 12 seconds "waiting for the gossip to settle", which are completely
unnecessary in the one-node cluster used in the Alternator test.
So a simple "--skip-wait-for-gossip-to-settle 0" parameter eliminates
these long delays completely.

Amusingly, the Scylla boot is now so fast, that I had to change a "sleep 2"
in the test script to "sleep 1", because 2 seconds is now much more than
it takes to boot Scylla :-)

Fixes #6310.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200428145035.22894-1-nyh@scylladb.com>
(cherry picked from commit ff5615d59d)
2020-05-04 16:10:27 +03:00
Alejo Sanchez
dd88b2dd18 utils: error injection allocate string for remote invoke
Allocate string before sending to other shards.

Reported by Pavel Solodovnikov.

Refs #3295 (closed)

Tests: unit ({dev})

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20200328204454.1326514-2-alejo.sanchez@scylladb.com>
(cherry picked from commit e5a2ba32b9)

Ref #6342.
2020-05-03 19:33:34 +03:00
Hagit Segev
eee4c00e29 release: prepare for 4.0.rc3 2020-05-01 00:46:40 +03:00
Avi Kivity
85071ceeb1 Merge 'Fix hang in multishard_writer' from Asias
"
This series fix hang in multishard_writer when error happens. It contains
- multishard_writer: Abort the queue attached to consumers when producer fails
- repair: Fix hang when the writer is dead

Fixes #6241
Refs: #6248
"

* asias-stream_fix_multishard_writer_hang:
  repair: Fix hang when the writer is dead
  mutation_writer_test: Add test_multishard_writer_producer_aborts
  multishard_writer: Abort the queue attached to consumers when producer fails

(cherry picked from commit 8925e00e96)
2020-04-30 19:32:12 +03:00
Asias He
4cf201fc24 config: Do not enable repair based node operations by default
Give it some more time to mature. Use the old stream plan based node
operations by default.

Fixes: #6305
Backports: 4.0
(cherry picked from commit b8ac10c451)
2020-04-30 17:57:55 +03:00
Raphael S. Carvalho
c6ad5cf556 api/service: fix segfault when taking a snapshot without keyspace specified
If no keyspace is specified when taking snapshot, there will be a segfault
because keynames is unconditionally dereferenced. Let's return an error
because a keyspace must be specified when column families are specified.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200427195634.99940-1-raphaelsc@scylladb.com>
(cherry picked from commit 02e046608f)

Fixes #6336.
2020-04-30 12:49:13 +03:00
Piotr Sarna
51e3e6c655 Update seastar submodule
* seastar 251bc8f2...8bc24f48 (1):
  > http: make headers case-insensitive

Fixes #6319
2020-04-30 08:18:01 +02:00
Nadav Har'El
8ac6579b30 test.py: run Alternator test with the correct Scylla binary
The Alternator test's run script, test/alternator/run, runs Scylla.
By default, it chooses the last built Scylla executable build/*/scylla.

However, test.py has a "mode" option, that should be able to choose which
build mode to run. Before this patch, this mode option wasn't honored by
the Alternator test, so a "test.py alternator/run" would run the same
Scylla binary (the one last built) three times, instead of running each
of the three build modes.

We fix this in this patch: test.py now passes the "SCYLLA" environment
variable to the test/alternator/run script, indicating the location of the
Scylla binary with the appropriate build mode. The script already supported
this environment variable to override its default choice of Scylla binary.

In test.py, we add to the run_test() function an optional "env" parameter
which can be used to pass additional environment variables to the test.

Fixes #6286

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200427131958.28248-1-nyh@scylladb.com>
(cherry picked from commit 858a12755b)
2020-04-28 16:19:07 +03:00
Piotr Sarna
3744e66244 alternator: fix integer overflow warning in token generation
When generating tokens for parallel scan, debug mode undefined behavior
sanitizer complained that integer overflow sometimes happens when
multiplying two big values - delta and segment number.
In order to mitigate this warning, the multiplication is now split
into two smaller ones, and the generated machine code remains
identical (verified on gcc and clang via compiler explorer).

Fixes #6280
Tests: unit(dev)

(cherry picked from commit e17c237feb)
2020-04-28 16:15:31 +03:00
Piotr Sarna
d3bf349484 alternator: allow parallel scan
Parallel scans can be performed by providing Segment and TotalSegments
attributes to Scan request, which can be used to split the work among
many workers.
This test makes the parallel scan test succeed, so the xfail is removed.

Fixes #5059

(cherry picked from commit dbb9574aa2)
2020-04-28 16:07:43 +03:00
Nadav Har'El
3e6a8ba5bd test/alternator: increase timeout on Scylla boot
The Alternator test boots Scylla to test against it. We set an arbitrary
timeout for this boot to succeed: 100 seconds. This 100 seconds is
significantly more than 25 seconds it takes on my laptop, and I though
we'll never reach it. But it turns out that in some setups - running the
very slow debug build on slow and overcommitted nodes - 100 seconds is
not enough.

So this patch doubles the timeout to 200 seconds.

Note that this "200 seconds" is just a timeout, and doesn't affect normal
runs: Both a successful boot and a failed boot are recognized as soon as
they happen, and we never unnecessarily wait the entire 200 seconds.

Fixes #6271.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200422193920.17079-1-nyh@scylladb.com>
(cherry picked from commit 92e36c5df5)
2020-04-28 16:04:12 +03:00
Nadav Har'El
5f1785b9cf alternator: use RF=3 even if some nodes are temporarily down
Alternator is supposed to use RF=3 for new tables. Only when the cluster is
smaller than 3 nodes do we use RF=1 (and warn about it) - this is useful for
testing.

However, our implementation incorrectly tested the number of *live* nodes in
the cluster instead of the total number of nodes. As a result, if a 3-node
cluster had one node down, and a new table was created, it was created with
RF=1, and immediately could not be written because when RF=1, any node down
means part of the data is unavailable.

This patch fixes this: The total number of nodes in the cluster - not the
number of live nodes - is consulted. The three-node-cluster-with-a-dead-node
setup above creates the table with RF=3, and it can be written because two
living nodes out of three are enough when RF=3 and we do quorum writes and
reads.

We have a dtest to reproduce this bug (and its fix), and it's also easy to
reproduce manually by starting a 3-node cluster, killing one of the nodes,
and then running "pytests". Before this patch, the tests can create tables
but then fail to write to them. After this patch, the test succeed on the
same cluster with the dead node.

Fixes #6267

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200422182035.15106-2-nyh@scylladb.com>
(cherry picked from commit 1f75efb556)
2020-04-28 15:52:06 +03:00
Nadav Har'El
e1fd6cf989 gossiper: add convenience function for getting number of nodes
The gossiper has a convenience functions get_up_endpoint_count() and
get_down_endpoint_count(), but strangely no function to get the total
number. Even though it's easy to calculate the total by summing up their
result it is inefficient and also incovenient because of of these
functions returns a future.

So let's add another function, get_all_endpoint_count(), to get the
total number of nodes. We will use this function in the next patch.

Signed-off-by: Nadav Har'El <n...@scylladb.com>
Message-Id: <20200422182035.15106-1-nyh@scylladb.com>
(cherry picked from commit 08c39bde1a)
2020-04-28 15:51:37 +03:00
Piotr Sarna
b7328ff1e4 alternator: implement ScanIndexForward
The ScanIndexForward parameter is now fully implemented
and can accept ScanIndexForward=false in order to query
the partitions in reverse clustering order.
Note that reading partition slices in reverse order is less
efficient than forward scans and may put a strain on memory
usage, especially for large partitions, since the whole partition
is currently fetched in order to be reversed.

Fixes #5153

(cherry picked from commit 09e4f3b917)
2020-04-28 15:30:01 +03:00
Avi Kivity
602ed43ac7 Update seastar submodule
* seastar 76260705ef...251bc8f25d (1):
  > http server: fix "Date" header format

Fixes #6253.
2020-04-26 19:30:08 +03:00
Tomasz Grabiec
c42c91c5bb Merge "Drop only learnt value on PRUNE" from Gleb
It is unsafe to remove entire row, so only drop learn value from
system.paxos table.

Fixes: #6154
(cherry picked from commit e648e314e5)
2020-04-21 18:30:12 +03:00
Avi Kivity
cf017b320a test: alternator: configure scylla for test environment in terms of cpu and disk
Currently, the alternator tests configure scylla to use all the
logical cores in the host system, but only 1GB of RAM. This can lead
to a small amount of memory per core.

It also uses the default disk configuration, which is safe, but can be
very slow on mechanical or non-enterprise disks.

Change to use a fixed --smp 2 configuration, and add --overprovisioned
for maximum flexibility (no spinning). Use --unsafe-bypass-fsync
for faster performance on non-enterprise or mechanical disks, assuming
that the test data is not important.

Fixes #6251.
Message-Id: <20200420154112.123386-1-avi@scylladb.com>

(cherry picked from commit 2482e53de9)
2020-04-21 18:25:28 +03:00
Hagit Segev
89e79023ae release: prepare for 4.0.rc2 2020-04-21 16:26:09 +03:00
Nadav Har'El
bc67da1a21 alternator-test: comment out an error-path test that doesn't work on newer boto3
Unfortunately, the boto3 library doen't allow us to check some of the
input error cases because it unnecessarily tests its input instead of
just passing it to Alternator and allowing Alternator to report the error.
In this patch we comment out a test case which used to work fine - i.e.,
the error was reported by Alternator - until recent changes to boto3
made it catch the problem without passing it to Alternator :-(

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200330190521.19526-2-nyh@scylladb.com>
(cherry picked from commit fe6cecb26d)
2020-04-21 07:19:54 +02:00
Botond Dénes
0c7643f1fe schema: schema(): use std::stable_sort() to sort key columns
When multiple key columns (clustering or partition) are passed to
the schema constructor, all having the same column id, the expectation
is that these columns will retain the order in which they were passed to
`schema_builder::with_column()`. Currently however this is not
guaranteed as the schema constructor sort key columns by column id with
`std::sort()`, which doesn't guarantee that equally comparing elements
retain their order. This can be an issue for indexes, the schemas of
which are built independently on each node. If there is any room for
variance between for the key column order, this can result in different
nodes having incompatible schemas for the same index.
The fix is to use `std::stable_sort()` which guarantees that the order
of equally comparing elements won't change.

This is a suspected cause of #5856, although we don't have hard proof.

Fixes: #5856
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
[avi: upgraded "Refs" to "Fixes", since we saw that std::sort() becomes
      unstable at 17 elements, and the failing schema had a
      clustering key with 23 elements]
Message-Id: <20200417121848.1456817-1-bdenes@scylladb.com>
(cherry picked from commit a4aa753f0f)
2020-04-19 18:18:45 +03:00
Rafael Ávila de Espíndola
c563234f40 dht: Use get_random_number<uint64_t> instead of int64_t in token::get_random_token
I bisect the opposite change in
9c202b52da as the cause of issue 6193. I
don't know why. Maybe get_random_number<signed_type> is buggy?

In any case, reverting to uint64_t solves the issue.

Fixes #6193

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200418001611.440733-1-espindola@scylladb.com>
(cherry picked from commit f3fd466156)
2020-04-19 16:20:40 +03:00
Nadav Har'El
77b7a48a02 alternator: remove mentions of experimental status of LWT
Since commit 9948f548a5, the LWT no longer
requires an "experimental" flag, so Alternator documents and scripts
which referred to the need for enabling experimental LWT, are fixed here
to no longer do that.

Fixes #6118.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200405143237.12693-1-nyh@scylladb.com>
(cherry picked from commit d9d50362af)
2020-04-19 15:10:32 +03:00
Piotr Sarna
b2b1bfb159 alternator: fix failure on incorrect table name with no indexes
If a table name is not found, it may still exist as a local index,
but the check tried to fetch a local index name regardless if it was
present in the request, which was a nullptr dereference bug.

Fixes #6161
Tests: alternator-test(local, remote)
Message-Id: <428c21e94f6c9e450b1766943677613bd46cbc68.1586347130.git.sarna@scylladb.com>

(cherry picked from commit 123edfc10c)
2020-04-19 15:07:25 +03:00
Nadav Har'El
d72cbe37aa docs/alternator/alternator.md: fix typos
Fix a couple of typos in the Alternator documentation.
Fixes scylladb/scylla-doc-issues#280
Fixes scylladb/scylla-doc-issues#281

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200419091900.23030-1-nyh@scylladb.com>
(cherry picked from commit 7e7c688946)
2020-04-19 15:03:22 +03:00
Nadav Har'El
9f7b560771 docs, alternator: alternator.md cleanup
Clean up the alternator.md document, by:

* Updating out-of-date information that outstayed its welcome.
* When Scylla does have a feature but it's just not supported via the
  DynamoDB API (e.g., CDC and on-demand backups) mention that.
* Remove mention of Alternator being experimental and users should not
  store important data on it :-)
* Miscellaneous cleanups.

Fixes #6179.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200412094641.27186-1-nyh@scylladb.com>
(cherry picked from commit 606ae0744c)
2020-04-19 15:00:53 +03:00
Nadav Har'El
06af9c028c alternator-test: make Alternator tests runnable from test.py
To make the tests in alternator-test runnable by test.py, we need to
move the directory alternator-test/ to test/alternator, because test.py
only looks for tests in subdirectories of test/. Then, we need to create
a test/alternator/suite.yaml saying that this test directory is of type
"Run", i.e., it has a single run script "run" which runs all its tests.

The "run" script had to be slightly modified to be aware of its new
location relative to the source directory.

To run the Alternator tests from test.py, do:

	./test.py --mode dev alternator

Note that in this version, the "--mode" has no effect - test/alternator/run
always runs the latest compiled Scylla, regardless of the chosen mode.

The Alternator tests can still be run manually and individually against
a running Scylla or DynamoDB as before - just go to the test/alternator
directory (instead of alternator-test previously) and run "pytest" with
the desired parameters.

Fixes #6046

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 4e2bf28b84)
2020-04-19 11:19:15 +03:00
Nadav Har'El
c74ab3ae80 test.py: add xunit XML output file for "Run" tests
Assumes that "Run" tests can take the --junit-xml=<path> option, and
pass it to ask the test to generate an XML summary of the run to a file
like testlog/dev/xml/run.1.xunit.xml.

This option is honored by the Alternator tests.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 0cccb5a630)
2020-04-19 11:19:06 +03:00
Nadav Har'El
32cd3a070a test.py: add new test type "Run"
This patch adds a new test type, "Run". A test subdirectory of type "Run"
has a script called "run" which is expected to run all the tests in that
directory.

This will be used, in the next patch, by the Alternator functional tests.
These tests indeed have a "run" script, which runs Scylla and then runs
*all* of Alternator's tests, finishing fairly quickly (in less than a
minute). All of that will become one test.py test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 0ae3136900)
2020-04-19 11:18:01 +03:00
Nadav Har'El
bb1554f09e test.py: flag for aborting tests with SIGTERM, not SIGKILL
Today, if test.py is interrupted with SIGINT or SIGTERM, the ongoing test
is killed with SIGKILL. Some types of tests - such as Alternator's test -
may depend on being killed politely (e.g., with SIGTERM) to clean up
files.

We cannot yet change the signal to SIGTERM for all tests, because Seastar
tests often don't deal well with signals, but we can at least add a flag
that certain test types - that know they can be killed gently - will use.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 36e44972f1)
2020-04-19 11:17:51 +03:00
Nadav Har'El
2037d7550e alternator-test: change "run" script to pick random IP address
Before this patch, the Alternator tests "run" script ran Scylla on a fixed
listening address, 127.0.0.1. There is a problem that there might be other
concurrent runs of Scylla using the same IP address - e.g., CCM (used by
dtest) uses exactly this IP address for its first node.

Luckily, Linux's loopback device actually allows us to pick any of over
a million addresses in 127.0.0.0/8 to listen on - we don't need to use
127.0.0.1 specifically. So the code in this patch picks an address in
127.1.*.*, so it cannot collide with CCM (which uses 127.0.0.* for up to
255 nodes). Moreover, the last two bytes of the listen address are picked
based on the process ID of the run script; This allows multiple copies
of this script to run concurrently - in case anybody wishes to do that.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 24fcc0c0ff)
2020-04-19 11:17:39 +03:00
Nadav Har'El
c320c3f6da install-dependencies.sh: add dependencies for Alternator tests
To run Alternator tests, only two additional dependencies need to be added to
install-dependencies.sh: pytest, and python3-boto3. We also need
python3-cassandra-driver, but this dependency is already listed.

This patch only updates the dependencies for Fedora, which is what we need
for dbuild and our Jenkins setups.

Tested by building a new dbuild docker image and verifying that the
Alternator tests pass.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
[avi: update toolchain image; note this upgrades gcc to 9.3.1]
Message-Id: <20200330181128.18582-1-nyh@scylladb.com>
(cherry picked from commit 8627ae42a6)
2020-04-19 11:17:07 +03:00
Nadav Har'El
0ed70944aa alternator-test: run: use the Python driver, not cqlsh
The "run" script for the Alternator tests needs to set a system table for
authentication credentials, so we can test this feature.
So far we did this with cqlsh, but cqlsh isn't always installed on build
machines. But install-dependencies.sh already installs the Cassandra driver
for Python, so it makes more sense to use that, so this patch switches to
use it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200331131522.28056-1-nyh@scylladb.com>
(cherry picked from commit 55f02c00f2)
2020-04-19 11:16:54 +03:00
Nadav Har'El
89f860d409 alternator-test: add "--url" option to choose Alternator's URL
The "--aws" and "--local" test options chooses between two useful default
URLs - Amazon's, or http://localhost:8000 for a local installation.
However, sometimes one wants to run Scylla on a different IP address or
port, so in this patch we add a "--url" option to choose a specific URL to
connect to. For example, "--url http://127.1.2.3:1234".

We will later use this option in the alternator-test/run script, to pick
a random IP address on which to run Scylla, and then run the test against
this address.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 1aec4baa51)
2020-04-19 11:13:13 +03:00
Piotr Sarna
0819d221f4 test: add cases for empty paging state for index queries
In order to check regressions related to #6136 and similar issues,
test cases for handling paging state with empty partition/clustering
key pair are added.

(cherry picked from commit 88913e9d44)
2020-04-19 10:35:26 +03:00
Piotr Sarna
53f47d4e67 cql3: fix generating base keys from empty index paging state
An empty partition/clustering key pair is a valid state of the
query paging state. Unfortunately, recent attempts at debugging
a flaky test resulted in introducing an assertion which breaks
when trying to generate a key from such a pair.
In order to keep the assertion (since it still makes sense in its
scope), but at the same time translate empty keys properly,
empty keys are now explicitly processed at the beginning of the
function.
This behaviour was 100% reproducible in a secondary index dtest below.

Fixes #6134
Refs #5856
Tests: unit(dev),
       dtest(TestSecondaryIndexes.test_truncate_base)

(cherry picked from commit 45751ee24f)
2020-04-19 10:35:09 +03:00
Kamil Braun
21ad12669a sstables: freeze types nested in collection types in legacy sstables
Some legacy `mc` SSTables (created in Scylla 3.0) may contain incorrect
serialization headers, which don't wrap frozen UDTs nested inside collections
with the FrozenType<...> tag. When reading such SSTable,
Scylla would detect a mismatch between the schema saved in schema
tables (which correctly wraps UDTs in the FrozenType<...> tag) and the schema
from the serialization header (which doesn't have these tags).

SSTables created in Scylla versions 3.1 and above, in particular in
Scylla versions that contain this commit, create correct serialization
headers (which wrap UDTs in the FrozenType<...> tag).

This commit does two things:
1. for all SSTables created after this commit, include a new feature
   flag, CorrectUDTsInCollections, presence of which implies that frozen
   UDTs inside collections have the FrozenType<...> tag.
2. when reading a Scylla SSTable without the feature flag, we assume that UDTs
   nested inside collections are always frozen, even if they don't have
   the tag. This assumption is safe to be made, because at the time of
   this commit, Scylla does not allow non-frozen (multi-cell) types inside
   collections or UDTs, and because of point 1 above.

There is one edge case not covered: if we don't know whether the SSTable
comes from Scylla or from C*. In that case we won't make the assumption
described in 2. Therefore, if we get a mismatch between schema and
serialization headers of a table which we couldn't confirm to come from
Scylla, we will still reject the table. If any user encounters such an
issue (unlikely), we will have to use another solution, e.g. using a
separate tool to rewrite the SSTable.

Fixes #6130.

(cherry picked from commit 3d811e2f95)
2020-04-17 09:11:53 +03:00
Kamil Braun
c812359383 sstables: move definition of column_translation::state::build to a .cc file
Ref #6130
2020-04-17 09:11:38 +03:00
Piotr Sarna
1bd79705fb alternator: use partition tombstone if there's no clustering key
As @tgrabiec helpfully pointed out, creating a row tombstone
for a table which does not have a clustering key in its schema
creates something that looks like an open-ended range tombstone.
That's problematic for KA/LA sstable formats, which are incapable
of writing such tombstones, so a workaround is provided
in order to allow using KA/LA in alternator.

Fixes #6035

(cherry picked from commit 0a2d7addc0)
2020-04-16 12:01:51 +03:00
Avi Kivity
7e2ef386cc Update seastar submodule
* seastar 92c488706...76260705e (1):
  > rpc: always shutdown socket when stopping a client

Fixes #6060.
2020-04-16 10:56:31 +03:00
Avi Kivity
51bad7e72c Point seastar submodule at scylla-seastar.git branch-4.0
This allows us to backport seastar patches to Scylla 4.0.
2020-04-16 10:10:40 +03:00
Asias He
0379d0c031 repair: Send reason for node operations
Since 956b092012 (Merge "Repair based node
operation" from Asias), repair is used by other node operations like
bootstrap, decommission and so on.

Send the reason for the repair, so that we can handle the materialized
view update correctly according to the reason of the operation. We want
to trigger the view update only if the repair is used by repair
operation. Otherwise, the view table will be handled twice, 1) when the
view table is synced using repair 2) when the base table is synced using
repair and view table update is triggered.

Fixes #5930
Fixes #5998

(cherry picked from commit 066934f7c4)
2020-04-16 10:06:17 +03:00
Gleb Natapov
a8ef820f27 lwt: fix cas_now_pruning counter
Due to c&p error cas_now_pruning counter is increased instead of
decreased after an operation completes. Fix it.

Fixes #6116

Message-Id: <20200401142859.GA16953@scylladb.com>
(cherry picked from commit 4d9d226596)
2020-04-06 13:06:11 +02:00
Yaron Kaikov
9908f009a4 release: prepare for 4.0.rc1 2020-04-06 10:22:45 +03:00
Pavel Emelyanov
48d8a075b4 main: Do not destroy token_metadata
The storage_proxy instances hold references to token_metadata ones and
leave unwaited futures continuing to its query_partition_key_range_concurrent
method.

The latter is called from do_query so it's not that easy to find
out who is leaking. Keep the tokens not freed for a while.

Fixes: #6093
Test: manual start-stop

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200402183538.9674-1-xemul@scylladb.com>
(cherry picked from commit 86296ba557)
2020-04-05 13:47:57 +03:00
Konstantin Osipov
e3ddd607bc lwt: remove Paxos from experimental list
Always enable lightweight transactions. Remove the check for the command
line switch from the feature service, assuming LWT is always enabled.

Remove the check for LWT from Alternator.

Note that in order for the cluster to work with LWT, all nodes need
to support it.

Rename LWT to UNUSED in db/config.hh, to keep accepting lwt keyword in
--experimental-features command line option, but do nothing with it.

Changes in v2:
* remove enable_lwt feature flag, it's always there

Closes #6102

test: unit (dev, debug)
Message-Id: <20200401071149.41921-1-kostja@scylladb.com>
(cherry picked from commit 9948f548a5)
2020-04-05 08:56:42 +03:00
Piotr Jastrzebski
511773d466 token: relax the condition of the sanity check
When we switched token representation to int64_t
we added some sanity checks that byte representation
is always 8 bytes long.

It turns out that for token_kind::before_all_keys and
token_kind::after_all_keys bytes can sometimes be empty
because for those tokens they are just ignored. The check
introduced with the change is too strict and sometimes
throws the exception for tokens before/after all keys
created with empty bytes.

This patch relaxes the condition of the check and always
uses 0 as value of _data for special before/after all keys
tokens.

Fixes #6131

Tests: unit(dev, sct)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit a15b32c9d9)
2020-04-04 20:19:10 +03:00
Gleb Natapov
121cd383fa lwt: remove entries from system.paxos table after successful learn stage
The learning stage of PAXOS protocol leaves behind an entry in
system.paxos table with the last learned value (which can be large). In
case not all participants learned it successfully next round on the same
key may complete the learning using this info. But if all nodes learned
the value the entry does not serve useful purpose any longer.

The patch adds another round, "prune", which is executed in background
(limited to 1000 simultaneous instances) and removes the entry in
case all nodes replied successfully to the "learn" round.  It uses the
ballot's timestamp to do the deletion, so not to interfere with the
next round. Since deletion happens very close to previous writes it will
likely happen in memtable and will never reach sstable, so that reduces
memtable flush and compaction overhead.

Fixes #5779

Message-Id: <20200330154853.GA31074@scylladb.com>
(cherry picked from commit 8a408ac5a8)
2020-04-02 15:36:52 +02:00
Gleb Natapov
90639f48e5 lwt: rename "in_progress_ballot" cell to "promise" in system.paxos table
The value that is stored in "in_progress_ballot" cell is the value of
promised ballot, so call the cell accordingly to avoid confusion
especially as we have a notion of "in progress" proposal in the code
which is not the same as in_progress_ballot here.

We can still do it without care about backwards compatibility since LWT
is still marked as experimental.

Fixes #6087.

Message-Id: <20200326095758.GA10219@scylladb.com>
(cherry picked from commit b3db6f5b04)
2020-04-02 15:36:49 +02:00
Calle Wilund
8d029a04aa db::commitlog: Don't write trailing zero block unless needed
Fixes #5899

When terminating (closing) a segment, we write a trailing block
of zero so reader can have an empty region after last used chunk
as end marker. This is due to using recycled, pre-allocated
segments with potentially non-zero data extending over the point
where we are ending the segment (i.e. we are not fully filling
the segment due to a huge mutation or similar).

However, if we reach end of segment writing the final block
(typically many small mutations), the file will end naturally
after the data written, and any trailing zero block would in fact
just extend the file further. While this will only happen once per
segment recycled (independent on how many times it is recycled),
it is still both slightly breaking the disk usage contract and
also potentially causing some disk stalls due to metadata changes
(though of course very infrequent).

We should only write trailing zero if we are below the max_size
file size when terminating

Adds a small size check to commitlog test to verify size bounds.
(Which breaks without the patch)

v2:
- Fix test to take into account that files might be deleted
  behind our backs.
v3:
- Fix test better, by doing verification _before_ segments are
  queued for delete.

Message-Id: <20200226121601.15347-2-calle@scylladb.com>
Message-Id: <20200324100235.23982-1-calle@scylladb.com>
(cherry picked from commit 9fee712d62)
2020-03-31 14:22:20 +03:00
Asias He
67995db899 gossip: Add an option to force gossip generation
Consider 3 nodes in the cluster, n1, n2, n3 with gossip generation
number g1, g2, g3.

n1, n2, n3 running scylla version with commit
0a52ecb6df (gossip: Fix max generation
drift measure)

One year later, user wants the upgrade n1,n2,n3 to a new version

when n3 does a rolling restart with a new version, n3 will use a
generation number g3'. Because g3' - g2 > MAX_GENERATION_DIFFERENCE and
g3' - g1 > MAX_GENERATION_DIFFERENCE, so g1 and g2 will reject n3's
gossip update and mark g3 as down.

Such unnecessary marking of node down can cause availability issues.
For example:

DC1: n1, n2
DC2: n3, n4

When n3 and n4 restart, n1 and n2 will mark n3 and n4 as down, which
causes the whole DC2 to be unavailable.

To fix, we can start the node with a gossip generation within
MAX_GENERATION_DIFFERENCE difference for the new node.

Once all the nodes run the version with commit
0a52ecb6df, the option is no logger
needed.

Fixes #5164

(cherry picked from commit 743b529c2b)
2020-03-30 12:36:20 +02:00
Yaron Kaikov
282cd0df7c dist/docker: Update SCYLLA_REPO_URL and VERSION defaults
Update the SCYLLA_REPO_URL and VERSION defaults to point to the latest
unstable 4.0 version. This will be used if someone runs "docker build"
locally. For the releases, the release pipelines will pass the stable
version repository URL and a specific release version.
2020-03-26 09:54:44 +02:00
Nadav Har'El
ce58994d30 sstable: default to LA format instead of KA format
Over the years, Scylla updated the sstable format from the KA format to
the LA format, and most recently to the MC format. On a mixed cluster -
as occurs during a rolling upgrade - we want all the nodes, even new ones,
to write sstables in the format preferred by the old version. The thinking
is that if the upgrade fails, and we want to downgrade all nodes back to
the older version, we don't want to lose data because we already have
too-new sstables.

So the current code starts by selecting the oldest format we ever had - KA,
and only switching this choice to LA and MC after we verify that all the
nodes in the cluster support these newer formats.

But before an agreement is reached on the new format, sstables may already
be created in the antique KA format. This is usually harmless - we can
read this format just fine. However, the KA format has a problem that it is
unable to represent table names or keyspaces with the "-" character in them,
because this character is used to separate the keyspace and table names in
the file name. For CQL, a "-" is not allowed anyway in keyspace or table
names; But for Alternator, this character is allowed - and if a KA table
happens to be created by accident (before the LA or MC formats are chosen),
it cannot be read again during boot, and Scylla cannot reboot.

The solution that this patch takes is to change Scylla's default sstable
format to LA (and, as before, if the entire cluster agrees, the newer MC
format will be used). From now on, new KA tables will never be written.
But we still fully support *reading* the KA format - this is important in
case some very old sstables never underwent compaction.

The old code had, confusingly, two places where the default KA format
was chosen. This patch fixes is so the new default (LA) is specified in
only one place.

Fixes #6071.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200324232607.4215-2-nyh@scylladb.com>
(cherry picked from commit 91aba40114)
2020-03-25 13:27:51 +01:00
Yaron Kaikov
78f5afec30 release: prepare for 4.0.rc0 2020-03-24 23:33:23 +02:00
125 changed files with 1444 additions and 489 deletions

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -1,7 +1,7 @@
#!/bin/sh
PRODUCT=scylla
VERSION=666.development
VERSION=4.0.4
if test -f version
then

View File

@@ -66,8 +66,9 @@ static std::string format_time_point(db_clock::time_point tp) {
time_t time_point_repr = db_clock::to_time_t(tp);
std::string time_point_str;
time_point_str.resize(17);
::tm time_buf;
// strftime prints the terminating null character as well
std::strftime(time_point_str.data(), time_point_str.size(), "%Y%m%dT%H%M%SZ", std::gmtime(&time_point_repr));
std::strftime(time_point_str.data(), time_point_str.size(), "%Y%m%dT%H%M%SZ", ::gmtime_r(&time_point_repr, &time_buf));
time_point_str.resize(16);
return time_point_str;
}

View File

@@ -208,12 +208,11 @@ get_table_or_view(service::storage_proxy& proxy, const rjson::value& request) {
throw api_error("ValidationException",
format("Non-string IndexName '{}'", index_name->GetString()));
}
}
// If no tables for global indexes were found, the index may be local
if (!proxy.get_db().local().has_schema(keyspace_name, table_name)) {
type = table_or_view_type::lsi;
table_name = lsi_name(orig_table_name, index_name->GetString());
// If no tables for global indexes were found, the index may be local
if (!proxy.get_db().local().has_schema(keyspace_name, table_name)) {
type = table_or_view_type::lsi;
table_name = lsi_name(orig_table_name, index_name->GetString());
}
}
try {
@@ -566,7 +565,7 @@ static void validate_tags(const std::map<sstring, sstring>& tags) {
// to races during concurrent updates of the same table. Once Scylla schema updates
// are fixed, this issue will automatically get fixed as well.
enum class update_tags_action { add_tags, delete_tags };
static future<> update_tags(const rjson::value& tags, schema_ptr schema, std::map<sstring, sstring>&& tags_map, update_tags_action action) {
static future<> update_tags(service::migration_manager& mm, const rjson::value& tags, schema_ptr schema, std::map<sstring, sstring>&& tags_map, update_tags_action action) {
if (action == update_tags_action::add_tags) {
for (auto it = tags.Begin(); it != tags.End(); ++it) {
const rjson::value& key = (*it)["Key"];
@@ -593,24 +592,12 @@ static future<> update_tags(const rjson::value& tags, schema_ptr schema, std::ma
}
validate_tags(tags_map);
std::stringstream serialized_tags;
serialized_tags << '{';
for (auto& tag_entry : tags_map) {
serialized_tags << format("'{}':'{}',", tag_entry.first, tag_entry.second);
}
std::string serialized_tags_str = serialized_tags.str();
if (!tags_map.empty()) {
serialized_tags_str[serialized_tags_str.size() - 1] = '}'; // trims the last ',' delimiter
} else {
serialized_tags_str.push_back('}');
}
sstring req = format("ALTER TABLE \"{}\".\"{}\" WITH {} = {}",
schema->ks_name(), schema->cf_name(), tags_extension::NAME, serialized_tags_str);
return db::execute_cql(std::move(req)).discard_result();
schema_builder builder(schema);
builder.set_extensions(schema::extensions_map{{sstring(tags_extension::NAME), ::make_shared<tags_extension>(std::move(tags_map))}});
return mm.announce_column_family_update(builder.build(), false, std::vector<view_ptr>(), false);
}
static future<> add_tags(service::storage_proxy& proxy, schema_ptr schema, rjson::value& request_info) {
static future<> add_tags(service::migration_manager& mm, service::storage_proxy& proxy, schema_ptr schema, rjson::value& request_info) {
const rjson::value* tags = rjson::find(request_info, "Tags");
if (!tags || !tags->IsArray()) {
return make_exception_future<>(api_error("ValidationException", format("Cannot parse tags")));
@@ -620,7 +607,7 @@ static future<> add_tags(service::storage_proxy& proxy, schema_ptr schema, rjson
}
std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
return update_tags(rjson::copy(*tags), schema, std::move(tags_map), update_tags_action::add_tags);
return update_tags(mm, rjson::copy(*tags), schema, std::move(tags_map), update_tags_action::add_tags);
}
future<executor::request_return_type> executor::tag_resource(client_state& client_state, service_permit permit, rjson::value request) {
@@ -632,7 +619,7 @@ future<executor::request_return_type> executor::tag_resource(client_state& clien
return api_error("AccessDeniedException", "Incorrect resource identifier");
}
schema_ptr schema = get_table_from_arn(_proxy, std::string_view(arn->GetString(), arn->GetStringLength()));
add_tags(_proxy, schema, request).get();
add_tags(_mm, _proxy, schema, request).get();
return json_string("");
});
}
@@ -653,7 +640,7 @@ future<executor::request_return_type> executor::untag_resource(client_state& cli
schema_ptr schema = get_table_from_arn(_proxy, std::string_view(arn->GetString(), arn->GetStringLength()));
std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
update_tags(*tags, schema, std::move(tags_map), update_tags_action::delete_tags).get();
update_tags(_mm, *tags, schema, std::move(tags_map), update_tags_action::delete_tags).get();
return json_string("");
});
}
@@ -870,7 +857,7 @@ future<executor::request_return_type> executor::create_table(client_state& clien
}).then([this, table_info = std::move(table_info), schema] () mutable {
future<> f = make_ready_future<>();
if (rjson::find(table_info, "Tags")) {
f = add_tags(_proxy, schema, table_info);
f = add_tags(_mm, _proxy, schema, table_info);
}
return f.then([table_info = std::move(table_info), schema] () mutable {
rjson::value status = rjson::empty_object();
@@ -1019,13 +1006,22 @@ put_or_delete_item::put_or_delete_item(const rjson::value& item, schema_ptr sche
mutation put_or_delete_item::build(schema_ptr schema, api::timestamp_type ts) {
mutation m(schema, _pk);
auto& row = m.partition().clustered_row(*schema, _ck);
// If there's no clustering key, a tombstone should be created directly
// on a partition, not on a clustering row - otherwise it will look like
// an open-ended range tombstone, which will crash on KA/LA sstable format.
// Ref: #6035
const bool use_partition_tombstone = schema->clustering_key_size() == 0;
if (!_cells) {
// a DeleteItem operation:
row.apply(tombstone(ts, gc_clock::now()));
if (use_partition_tombstone) {
m.partition().apply(tombstone(ts, gc_clock::now()));
} else {
// a DeleteItem operation:
m.partition().clustered_row(*schema, _ck).apply(tombstone(ts, gc_clock::now()));
}
return m;
}
// else, a PutItem operation:
auto& row = m.partition().clustered_row(*schema, _ck);
attribute_collector attrs_collector;
for (auto& c : *_cells) {
const column_definition* cdef = schema->get_column_definition(c.column_name);
@@ -1048,7 +1044,11 @@ mutation put_or_delete_item::build(schema_ptr schema, api::timestamp_type ts) {
// Scylla proper, to implement the operation to replace an entire
// collection ("UPDATE .. SET x = ..") - see
// cql3::update_parameters::make_tombstone_just_before().
row.apply(tombstone(ts-1, gc_clock::now()));
if (use_partition_tombstone) {
m.partition().apply(tombstone(ts-1, gc_clock::now()));
} else {
row.apply(tombstone(ts-1, gc_clock::now()));
}
return m;
}
@@ -1202,11 +1202,6 @@ std::optional<shard_id> rmw_operation::shard_for_execute(bool needs_read_before_
// PutItem, DeleteItem). All these return nothing by default, but can
// optionally return Attributes if requested via the ReturnValues option.
static future<executor::request_return_type> rmw_operation_return(rjson::value&& attributes) {
// As an optimization, in the simple and common case that nothing is to be
// returned, quickly return an empty result:
if (attributes.IsNull()) {
return make_ready_future<executor::request_return_type>(json_string(""));
}
rjson::value ret = rjson::empty_object();
if (!attributes.IsNull()) {
rjson::set(ret, "Attributes", std::move(attributes));
@@ -2773,6 +2768,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
[] (std::vector<std::tuple<std::string, std::optional<rjson::value>>> responses) {
rjson::value response = rjson::empty_object();
rjson::set(response, "Responses", rjson::empty_object());
rjson::set(response, "UnprocessedKeys", rjson::empty_object());
for (auto& t : responses) {
if (!response["Responses"].HasMember(std::get<0>(t).c_str())) {
rjson::set_with_string_name(response["Responses"], std::get<0>(t), rjson::empty_array());
@@ -2889,6 +2885,7 @@ static future<executor::request_return_type> do_query(schema_ptr schema,
uint32_t limit,
db::consistency_level cl,
::shared_ptr<cql3::restrictions::statement_restrictions> filtering_restrictions,
query::partition_slice::option_set custom_opts,
service::client_state& client_state,
cql3::cql_stats& cql_stats,
tracing::trace_state_ptr trace_state,
@@ -2909,7 +2906,9 @@ static future<executor::request_return_type> do_query(schema_ptr schema,
auto regular_columns = boost::copy_range<query::column_id_vector>(
schema->regular_columns() | boost::adaptors::transformed([] (const column_definition& cdef) { return cdef.id; }));
auto selection = cql3::selection::selection::wildcard(schema);
auto partition_slice = query::partition_slice(std::move(ck_bounds), {}, std::move(regular_columns), selection->get_query_options());
query::partition_slice::option_set opts = selection->get_query_options();
opts.add(custom_opts);
auto partition_slice = query::partition_slice(std::move(ck_bounds), {}, std::move(regular_columns), opts);
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, query::max_partitions);
auto query_state_ptr = std::make_unique<service::query_state>(client_state, trace_state, std::move(permit));
@@ -2939,11 +2938,38 @@ static future<executor::request_return_type> do_query(schema_ptr schema,
});
}
static dht::token token_for_segment(int segment, int total_segments) {
assert(total_segments > 1 && segment >= 0 && segment < total_segments);
uint64_t delta = std::numeric_limits<uint64_t>::max() / total_segments;
return dht::token::from_int64(std::numeric_limits<int64_t>::min() + delta * segment);
}
static dht::partition_range get_range_for_segment(int segment, int total_segments) {
if (total_segments == 1) {
return dht::partition_range::make_open_ended_both_sides();
}
if (segment == 0) {
dht::token ending_token = token_for_segment(1, total_segments);
return dht::partition_range::make_ending_with(
dht::partition_range::bound(dht::ring_position::ending_at(ending_token), false));
} else if (segment == total_segments - 1) {
dht::token starting_token = token_for_segment(segment, total_segments);
return dht::partition_range::make_starting_with(
dht::partition_range::bound(dht::ring_position::starting_at(starting_token)));
} else {
dht::token starting_token = token_for_segment(segment, total_segments);
dht::token ending_token = token_for_segment(segment + 1, total_segments);
return dht::partition_range::make(
dht::partition_range::bound(dht::ring_position::starting_at(starting_token)),
dht::partition_range::bound(dht::ring_position::ending_at(ending_token), false)
);
}
}
// TODO(sarna):
// 1. Paging must have 1MB boundary according to the docs. IIRC we do have a replica-side reply size limit though - verify.
// 2. Filtering - by passing appropriately created restrictions to pager as a last parameter
// 3. Proper timeouts instead of gc_clock::now() and db::no_timeout
// 4. Implement parallel scanning via Segments
future<executor::request_return_type> executor::scan(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request) {
_stats.api_operations.scan++;
elogger.trace("Scanning {}", request);
@@ -2954,10 +2980,21 @@ future<executor::request_return_type> executor::scan(client_state& client_state,
return make_ready_future<request_return_type>(api_error("ValidationException",
"FilterExpression is not yet implemented in alternator"));
}
if (get_int_attribute(request, "Segment") || get_int_attribute(request, "TotalSegments")) {
// FIXME: need to support parallel scan. See issue #5059.
return make_ready_future<request_return_type>(api_error("ValidationException",
"Scan Segment/TotalSegments is not yet implemented in alternator"));
auto segment = get_int_attribute(request, "Segment");
auto total_segments = get_int_attribute(request, "TotalSegments");
if (segment || total_segments) {
if (!segment || !total_segments) {
return make_ready_future<request_return_type>(api_error("ValidationException",
"Both Segment and TotalSegments attributes need to be present for a parallel scan"));
}
if (*segment < 0 || *segment >= *total_segments) {
return make_ready_future<request_return_type>(api_error("ValidationException",
"Segment must be non-negative and less than TotalSegments"));
}
if (*total_segments < 0 || *total_segments > 1000000) {
return make_ready_future<request_return_type>(api_error("ValidationException",
"TotalSegments must be non-negative and less or equal to 1000000"));
}
}
rjson::value* exclusive_start_key = rjson::find(request, "ExclusiveStartKey");
@@ -2976,7 +3013,12 @@ future<executor::request_return_type> executor::scan(client_state& client_state,
auto attrs_to_get = calculate_attrs_to_get(request);
dht::partition_range_vector partition_ranges{dht::partition_range::make_open_ended_both_sides()};
dht::partition_range_vector partition_ranges;
if (segment) {
partition_ranges.push_back(get_range_for_segment(*segment, *total_segments));
} else {
partition_ranges.push_back(dht::partition_range::make_open_ended_both_sides());
}
std::vector<query::clustering_range> ck_bounds{query::clustering_range::make_open_ended_both_sides()};
::shared_ptr<cql3::restrictions::statement_restrictions> filtering_restrictions;
@@ -2986,14 +3028,15 @@ future<executor::request_return_type> executor::scan(client_state& client_state,
partition_ranges = filtering_restrictions->get_partition_key_ranges(query_options);
ck_bounds = filtering_restrictions->get_clustering_bounds(query_options);
}
return do_query(schema, exclusive_start_key, std::move(partition_ranges), std::move(ck_bounds), std::move(attrs_to_get), limit, cl, std::move(filtering_restrictions), client_state, _stats.cql_stats, trace_state, std::move(permit));
return do_query(schema, exclusive_start_key, std::move(partition_ranges), std::move(ck_bounds), std::move(attrs_to_get), limit, cl,
std::move(filtering_restrictions), query::partition_slice::option_set(), client_state, _stats.cql_stats, trace_state, std::move(permit));
}
static dht::partition_range calculate_pk_bound(schema_ptr schema, const column_definition& pk_cdef, comparison_operator_type op, const rjson::value& attrs) {
if (attrs.Size() != 1) {
throw api_error("ValidationException", format("Only a single attribute is allowed for a hash key restriction: {}", attrs));
}
bytes raw_value = pk_cdef.type->from_string(attrs[0][type_to_string(pk_cdef.type)].GetString());
bytes raw_value = get_key_from_typed_value(attrs[0], pk_cdef);
partition_key pk = partition_key::from_singular(*schema, pk_cdef.type->deserialize(raw_value));
auto decorated_key = dht::decorate_key(*schema, pk);
if (op != comparison_operator_type::EQ) {
@@ -3018,7 +3061,7 @@ static query::clustering_range calculate_ck_bound(schema_ptr schema, const colum
if (attrs.Size() != expected_attrs_size) {
throw api_error("ValidationException", format("{} arguments expected for a sort key restriction: {}", expected_attrs_size, attrs));
}
bytes raw_value = ck_cdef.type->from_string(attrs[0][type_to_string(ck_cdef.type)].GetString());
bytes raw_value = get_key_from_typed_value(attrs[0], ck_cdef);
clustering_key ck = clustering_key::from_single_value(*schema, raw_value);
switch (op) {
case comparison_operator_type::EQ:
@@ -3032,7 +3075,7 @@ static query::clustering_range calculate_ck_bound(schema_ptr schema, const colum
case comparison_operator_type::GT:
return query::clustering_range::make_starting_with(query::clustering_range::bound(ck, false));
case comparison_operator_type::BETWEEN: {
bytes raw_upper_limit = ck_cdef.type->from_string(attrs[1][type_to_string(ck_cdef.type)].GetString());
bytes raw_upper_limit = get_key_from_typed_value(attrs[1], ck_cdef);
clustering_key upper_limit = clustering_key::from_single_value(*schema, raw_upper_limit);
return query::clustering_range::make(query::clustering_range::bound(ck), query::clustering_range::bound(upper_limit));
}
@@ -3045,9 +3088,7 @@ static query::clustering_range calculate_ck_bound(schema_ptr schema, const colum
if (!ck_cdef.type->is_compatible_with(*utf8_type)) {
throw api_error("ValidationException", format("BEGINS_WITH operator cannot be applied to type {}", type_to_string(ck_cdef.type)));
}
std::string raw_upper_limit_str = attrs[0][type_to_string(ck_cdef.type)].GetString();
bytes raw_upper_limit = ck_cdef.type->from_string(raw_upper_limit_str);
return get_clustering_range_for_begins_with(std::move(raw_upper_limit), ck, schema, ck_cdef.type);
return get_clustering_range_for_begins_with(std::move(raw_value), ck, schema, ck_cdef.type);
}
default:
throw api_error("ValidationException", format("Unknown primary key bound passed: {}", int(op)));
@@ -3429,11 +3470,7 @@ future<executor::request_return_type> executor::query(client_state& client_state
if (rjson::find(request, "FilterExpression")) {
return make_ready_future<request_return_type>(api_error("ValidationException", "FilterExpression is not yet implemented in alternator"));
}
bool forward = get_bool_attribute(request, "ScanIndexForward", true);
if (!forward) {
// FIXME: need to support the !forward (i.e., reverse sort order) case. See issue #5153.
return make_ready_future<request_return_type>(api_error("ValidationException", "ScanIndexForward=false is not yet implemented in alternator"));
}
const bool forward = get_bool_attribute(request, "ScanIndexForward", true);
rjson::value* key_conditions = rjson::find(request, "KeyConditions");
rjson::value* key_condition_expression = rjson::find(request, "KeyConditionExpression");
@@ -3476,7 +3513,10 @@ future<executor::request_return_type> executor::query(client_state& client_state
}
verify_all_are_used(request, "ExpressionAttributeValues", used_attribute_values, "KeyConditionExpression");
verify_all_are_used(request, "ExpressionAttributeNames", used_attribute_names, "KeyConditionExpression");
return do_query(schema, exclusive_start_key, std::move(partition_ranges), std::move(ck_bounds), std::move(attrs_to_get), limit, cl, std::move(filtering_restrictions), client_state, _stats.cql_stats, std::move(trace_state), std::move(permit));
query::partition_slice::option_set opts;
opts.set_if<query::partition_slice::option::reversed>(!forward);
return do_query(schema, exclusive_start_key, std::move(partition_ranges), std::move(ck_bounds), std::move(attrs_to_get), limit, cl,
std::move(filtering_restrictions), opts, client_state, _stats.cql_stats, std::move(trace_state), std::move(permit));
}
future<executor::request_return_type> executor::list_tables(client_state& client_state, service_permit permit, rjson::value request) {
@@ -3567,12 +3607,12 @@ static std::map<sstring, sstring> get_network_topology_options(int rf) {
// manually create the keyspace to override this predefined behavior.
future<> executor::create_keyspace(std::string_view keyspace_name) {
sstring keyspace_name_str(keyspace_name);
return gms::get_up_endpoint_count().then([this, keyspace_name_str = std::move(keyspace_name_str)] (int up_endpoint_count) {
return gms::get_all_endpoint_count().then([this, keyspace_name_str = std::move(keyspace_name_str)] (int endpoint_count) {
int rf = 3;
if (up_endpoint_count < rf) {
if (endpoint_count < rf) {
rf = 1;
elogger.warn("Creating keyspace '{}' for Alternator with unsafe RF={} because cluster only has {} live nodes.",
keyspace_name_str, rf, up_endpoint_count);
elogger.warn("Creating keyspace '{}' for Alternator with unsafe RF={} because cluster only has {} nodes.",
keyspace_name_str, rf, endpoint_count);
}
auto opts = get_network_topology_options(rf);
auto ksm = keyspace_metadata::new_keyspace(keyspace_name_str, "org.apache.cassandra.locator.NetworkTopologyStrategy", std::move(opts), true);

View File

@@ -54,26 +54,22 @@ static sstring validate_keyspace(http_context& ctx, const parameters& param) {
throw bad_param_exception("Keyspace " + param["keyspace"] + " Does not exist");
}
static std::vector<ss::token_range> describe_ring(const sstring& keyspace) {
std::vector<ss::token_range> res;
for (auto d : service::get_local_storage_service().describe_ring(keyspace)) {
ss::token_range r;
r.start_token = d._start_token;
r.end_token = d._end_token;
r.endpoints = d._endpoints;
r.rpc_endpoints = d._rpc_endpoints;
for (auto det : d._endpoint_details) {
ss::endpoint_detail ed;
ed.host = det._host;
ed.datacenter = det._datacenter;
if (det._rack != "") {
ed.rack = det._rack;
}
r.endpoint_details.push(ed);
static ss::token_range token_range_endpoints_to_json(const dht::token_range_endpoints& d) {
ss::token_range r;
r.start_token = d._start_token;
r.end_token = d._end_token;
r.endpoints = d._endpoints;
r.rpc_endpoints = d._rpc_endpoints;
for (auto det : d._endpoint_details) {
ss::endpoint_detail ed;
ed.host = det._host;
ed.datacenter = det._datacenter;
if (det._rack != "") {
ed.rack = det._rack;
}
res.push_back(r);
r.endpoint_details.push(ed);
}
return res;
return r;
}
using ks_cf_func = std::function<future<json::json_return_type>(http_context&, std::unique_ptr<request>, sstring, std::vector<sstring>)>;
@@ -175,13 +171,13 @@ void set_storage_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(res);
});
ss::describe_any_ring.set(r, [&ctx](const_req req) {
return describe_ring("");
ss::describe_any_ring.set(r, [&ctx](std::unique_ptr<request> req) {
return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().describe_ring(""), token_range_endpoints_to_json));
});
ss::describe_ring.set(r, [&ctx](const_req req) {
auto keyspace = validate_keyspace(ctx, req.param);
return describe_ring(keyspace);
ss::describe_ring.set(r, [&ctx](std::unique_ptr<request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().describe_ring(keyspace), token_range_endpoints_to_json));
});
ss::get_host_id_map.set(r, [&ctx](const_req req) {
@@ -1000,6 +996,9 @@ void set_snapshot(http_context& ctx, routes& r) {
if (column_family.empty()) {
resp = service::get_local_storage_service().take_snapshot(tag, keynames);
} else {
if (keynames.empty()) {
throw httpd::bad_param_exception("The keyspace of column families must be specified");
}
if (keynames.size() > 1) {
throw httpd::bad_param_exception("Only one keyspace allowed when specifying a column family");
}

View File

@@ -33,6 +33,7 @@
#include "auth/resource.hh"
#include "seastarx.hh"
#include "exceptions/exceptions.hh"
namespace auth {
@@ -52,9 +53,9 @@ struct role_config_update final {
///
/// A logical argument error for a role-management operation.
///
class roles_argument_exception : public std::invalid_argument {
class roles_argument_exception : public exceptions::invalid_request_exception {
public:
using std::invalid_argument::invalid_argument;
using exceptions::invalid_request_exception::invalid_request_exception;
};
class role_already_exists : public roles_argument_exception {

View File

@@ -30,10 +30,12 @@ std::atomic<int64_t> clocks_offset;
std::ostream& operator<<(std::ostream& os, db_clock::time_point tp) {
auto t = db_clock::to_time_t(tp);
return os << std::put_time(std::gmtime(&t), "%Y/%m/%d %T");
::tm t_buf;
return os << std::put_time(::gmtime_r(&t, &t_buf), "%Y/%m/%d %T");
}
std::string format_timestamp(api::timestamp_type ts) {
auto t = std::time_t(std::chrono::duration_cast<std::chrono::seconds>(api::timestamp_clock::duration(ts)).count());
return format("{}", std::put_time(std::gmtime(&t), "%Y/%m/%d %T"));
::tm t_buf;
return format("{}", std::put_time(::gmtime_r(&t, &t_buf), "%Y/%m/%d %T"));
}

View File

@@ -87,17 +87,14 @@ template<typename ToType>
std::function<data_value(data_value)> make_castas_fctn_from_decimal_to_float() {
return [](data_value from) -> data_value {
auto val_from = value_cast<big_decimal>(from);
boost::multiprecision::cpp_int ten(10);
boost::multiprecision::cpp_rational r = val_from.unscaled_value();
r /= boost::multiprecision::pow(ten, val_from.scale());
return static_cast<ToType>(r);
return static_cast<ToType>(val_from.as_rational());
};
}
static utils::multiprecision_int from_decimal_to_cppint(const data_value& from) {
const auto& val_from = value_cast<big_decimal>(from);
boost::multiprecision::cpp_int ten(10);
return boost::multiprecision::cpp_int(val_from.unscaled_value() / boost::multiprecision::pow(ten, val_from.scale()));
auto r = val_from.as_rational();
return utils::multiprecision_int(numerator(r)/denominator(r));
}
template<typename ToType>

View File

@@ -49,7 +49,7 @@ relation::to_column_definition(const schema& schema, const column_identifier::ra
auto id = entity.prepare_column_identifier(schema);
auto def = get_column_definition(schema, *id);
if (!def || def->is_hidden_from_cql()) {
throw exceptions::unrecognized_entity_exception(id, shared_from_this());
throw exceptions::unrecognized_entity_exception(*id, to_string());
}
return *def;
}

View File

@@ -697,6 +697,11 @@ static query::range<bytes_view> to_range(const term_slice& slice, const query_op
extract_bound(statements::bound::END));
}
static bool contains_without_wraparound(
const query::range<bytes_view>& range, bytes_view value, const serialized_tri_compare& cmp) {
return !range.is_wrap_around(cmp) && range.contains(value, cmp);
}
bool single_column_restriction::slice::is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
@@ -711,7 +716,8 @@ bool single_column_restriction::slice::is_satisfied_by(const schema& schema,
return false;
}
return cell_value->with_linearized([&] (bytes_view cell_value_bv) {
return to_range(_slice, options).contains(cell_value_bv, _column_def.type->as_tri_comparator());
return contains_without_wraparound(to_range(_slice, options),
cell_value_bv, _column_def.type->as_tri_comparator());
});
}
@@ -719,7 +725,8 @@ bool single_column_restriction::slice::is_satisfied_by(bytes_view data, const qu
if (_column_def.type->is_counter()) {
fail(unimplemented::cause::COUNTERS);
}
return to_range(_slice, options).contains(data, _column_def.type->underlying_type()->as_tri_comparator());
return contains_without_wraparound(to_range(_slice, options),
data, _column_def.type->underlying_type()->as_tri_comparator());
}
bool single_column_restriction::contains::is_satisfied_by(const schema& schema,

View File

@@ -68,6 +68,7 @@ batch_statement::batch_statement(int bound_terms, type type_,
, _has_conditions(boost::algorithm::any_of(_statements, [] (auto&& s) { return s.statement->has_conditions(); }))
, _stats(stats)
{
validate();
if (has_conditions()) {
// A batch can be created not only by raw::batch_statement::prepare, but also by
// cql_server::connection::process_batch, which doesn't call any methods of
@@ -448,7 +449,6 @@ batch_statement::prepare(database& db, cql_stats& stats) {
prep_attrs->collect_marker_specification(bound_names);
cql3::statements::batch_statement batch_statement_(bound_names.size(), _type, std::move(statements), std::move(prep_attrs), stats);
batch_statement_.validate();
std::vector<uint16_t> partition_key_bind_indices;
if (!have_multiple_cfs && batch_statement_.get_statements().size() > 0) {

View File

@@ -255,7 +255,9 @@ void cf_prop_defs::apply_to_builder(schema_builder& builder, schema::extensions_
}
}
builder.set_default_time_to_live(gc_clock::duration(get_int(KW_DEFAULT_TIME_TO_LIVE, DEFAULT_DEFAULT_TIME_TO_LIVE)));
if (has_property(KW_DEFAULT_TIME_TO_LIVE)) {
builder.set_default_time_to_live(gc_clock::duration(get_int(KW_DEFAULT_TIME_TO_LIVE, DEFAULT_DEFAULT_TIME_TO_LIVE)));
}
if (has_property(KW_SPECULATIVE_RETRY)) {
builder.set_speculative_retry(get_string(KW_SPECULATIVE_RETRY, builder.get_speculative_retry().to_sstring()));

View File

@@ -434,6 +434,12 @@ GCC6_CONCEPT(
static KeyType
generate_base_key_from_index_pk(const partition_key& index_pk, const std::optional<clustering_key>& index_ck, const schema& base_schema, const schema& view_schema) {
const auto& base_columns = std::is_same_v<KeyType, partition_key> ? base_schema.partition_key_columns() : base_schema.clustering_key_columns();
// An empty key in the index paging state translates to an empty base key
if (index_pk.is_empty() && !index_ck) {
return KeyType::make_empty();
}
std::vector<bytes_view> exploded_base_key;
exploded_base_key.reserve(base_columns.size());
@@ -507,8 +513,7 @@ indexed_table_select_statement::do_execute_base_query(
if (old_paging_state && concurrency == 1) {
auto base_pk = generate_base_key_from_index_pk<partition_key>(old_paging_state->get_partition_key(),
old_paging_state->get_clustering_key(), *_schema, *_view_schema);
if (_schema->clustering_key_size() > 0) {
assert(old_paging_state->get_clustering_key().has_value());
if (old_paging_state->get_clustering_key() && _schema->clustering_key_size() > 0) {
auto base_ck = generate_base_key_from_index_pk<clustering_key>(old_paging_state->get_partition_key(),
old_paging_state->get_clustering_key(), *_schema, *_view_schema);
command->slice.set_range(*_schema, base_pk,
@@ -1362,8 +1367,8 @@ select_statement::prepare_restrictions(database& db,
return ::make_shared<restrictions::statement_restrictions>(db, schema, statement_type::SELECT, std::move(_where_clause), bound_names,
selection->contains_only_static_columns(), selection->contains_a_collection(), for_view, allow_filtering);
} catch (const exceptions::unrecognized_entity_exception& e) {
if (contains_alias(*e.entity)) {
throw exceptions::invalid_request_exception(format("Aliases aren't allowed in the where clause ('{}')", e.relation->to_string()));
if (contains_alias(e.entity)) {
throw exceptions::invalid_request_exception(format("Aliases aren't allowed in the where clause ('{}')", e.relation_str));
}
throw;
}

View File

@@ -1323,7 +1323,7 @@ future<mutation> database::do_apply_counter_update(column_family& cf, const froz
// counter state for each modified cell...
tracing::trace(trace_state, "Reading counter values from the CF");
return counter_write_query(m_schema, cf.as_mutation_source(), m.decorated_key(), slice, trace_state)
return counter_write_query(m_schema, cf.as_mutation_source(), m.decorated_key(), slice, trace_state, timeout)
.then([this, &cf, &m, m_schema, timeout, trace_state] (auto mopt) {
// ...now, that we got existing state of all affected counter
// cells we can look for our shard in each of them, increment

View File

@@ -614,11 +614,17 @@ public:
future<sseg_ptr> terminate() {
assert(_closed);
if (!std::exchange(_terminated, true)) {
clogger.trace("{} is closed but not terminated.", *this);
if (_buffer.empty()) {
new_buffer(0);
// write a terminating zero block iff we are ending (a reused)
// block before actual file end.
// we should only get here when all actual data is
// already flushed (see below, close()).
if (size_on_disk() < _segment_manager->max_size) {
clogger.trace("{} is closed but not terminated.", *this);
if (_buffer.empty()) {
new_buffer(0);
}
return cycle(true, true);
}
return cycle(true, true);
}
return make_ready_future<sseg_ptr>(shared_from_this());
}
@@ -2127,8 +2133,9 @@ db::commitlog::read_log_file(const sstring& filename, const sstring& pfx, seasta
}).handle_exception([w](auto ep) {
w->s.set_exception(ep);
});
return ret.done();
// #6265 - must keep subscription alive.
auto res = ret.done();
return res.finally([ret = std::move(ret)] {});
});
}

View File

@@ -681,7 +681,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, replace_address(this, "replace_address", value_status::Used, "", "The listen_address or broadcast_address of the dead node to replace. Same as -Dcassandra.replace_address.")
, replace_address_first_boot(this, "replace_address_first_boot", value_status::Used, "", "Like replace_address option, but if the node has been bootstrapped successfully it will be ignored. Same as -Dcassandra.replace_address_first_boot.")
, override_decommission(this, "override_decommission", value_status::Used, false, "Set true to force a decommissioned node to join the cluster")
, enable_repair_based_node_ops(this, "enable_repair_based_node_ops", liveness::LiveUpdate, value_status::Used, true, "Set true to use enable repair based node operations instead of streaming based")
, enable_repair_based_node_ops(this, "enable_repair_based_node_ops", liveness::LiveUpdate, value_status::Used, false, "Set true to use enable repair based node operations instead of streaming based")
, ring_delay_ms(this, "ring_delay_ms", value_status::Used, 30 * 1000, "Time a node waits to hear from other nodes before joining the ring in milliseconds. Same as -Dcassandra.ring_delay_ms in cassandra.")
, shadow_round_ms(this, "shadow_round_ms", value_status::Used, 300 * 1000, "The maximum gossip shadow round time. Can be used to reduce the gossip feature check time during node boot up.")
, fd_max_interval_ms(this, "fd_max_interval_ms", value_status::Used, 2 * 1000, "The maximum failure_detector interval time in milliseconds. Interval larger than the maximum will be ignored. Larger cluster may need to increase the default.")
@@ -689,6 +689,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, shutdown_announce_in_ms(this, "shutdown_announce_in_ms", value_status::Used, 2 * 1000, "Time a node waits after sending gossip shutdown message in milliseconds. Same as -Dcassandra.shutdown_announce_in_ms in cassandra.")
, developer_mode(this, "developer_mode", value_status::Used, false, "Relax environment checks. Setting to true can reduce performance and reliability significantly.")
, skip_wait_for_gossip_to_settle(this, "skip_wait_for_gossip_to_settle", value_status::Used, -1, "An integer to configure the wait for gossip to settle. -1: wait normally, 0: do not wait at all, n: wait for at most n polls. Same as -Dcassandra.skip_wait_for_gossip_to_settle in cassandra.")
, force_gossip_generation(this, "force_gossip_generation", liveness::LiveUpdate, value_status::Used, -1 , "Force gossip to use the generation number provided by user")
, experimental(this, "experimental", value_status::Used, false, "Set to true to unlock all experimental features.")
, experimental_features(this, "experimental_features", value_status::Used, {}, "Unlock experimental features provided as the option arguments (possible values: 'lwt', 'cdc', 'udf'). Can be repeated.")
, lsa_reclamation_step(this, "lsa_reclamation_step", value_status::Used, 1, "Minimum number of segments to reclaim in a single step")
@@ -859,7 +860,7 @@ db::fs::path db::config::get_conf_sub(db::fs::path sub) {
}
bool db::config::check_experimental(experimental_features_t::feature f) const {
if (experimental()) {
if (experimental() && f != experimental_features_t::UNUSED) {
return true;
}
const auto& optval = experimental_features();
@@ -911,11 +912,13 @@ const db::extensions& db::config::extensions() const {
std::unordered_map<sstring, db::experimental_features_t::feature> db::experimental_features_t::map() {
// We decided against using the construct-on-first-use idiom here:
// https://github.com/scylladb/scylla/pull/5369#discussion_r353614807
return {{"lwt", LWT}, {"udf", UDF}, {"cdc", CDC}};
// Lightweight transactions are no longer experimental. Map them
// to UNUSED switch for a while, then remove altogether.
return {{"lwt", UNUSED}, {"udf", UDF}, {"cdc", CDC}};
}
std::vector<enum_option<db::experimental_features_t>> db::experimental_features_t::all() {
return {LWT, UDF, CDC};
return {UDF, CDC};
}
template struct utils::config_file::named_value<seastar::log_level>;

View File

@@ -81,7 +81,7 @@ namespace db {
/// Enumeration of all valid values for the `experimental` config entry.
struct experimental_features_t {
enum feature { LWT, UDF, CDC };
enum feature { UNUSED, UDF, CDC };
static std::unordered_map<sstring, feature> map(); // See enum_option.
static std::vector<enum_option<experimental_features_t>> all();
};
@@ -278,6 +278,7 @@ public:
named_value<uint32_t> shutdown_announce_in_ms;
named_value<bool> developer_mode;
named_value<int32_t> skip_wait_for_gossip_to_settle;
named_value<int32_t> force_gossip_generation;
named_value<bool> experimental;
named_value<std::vector<enum_option<experimental_features_t>>> experimental_features;
named_value<size_t> lsa_reclamation_step;

View File

@@ -703,6 +703,7 @@ future<> manager::end_point_hints_manager::sender::send_one_hint(lw_shared_ptr<s
// Files are aggregated for at most manager::hints_timer_period therefore the oldest hint there is
// (last_modification - manager::hints_timer_period) old.
if (gc_clock::now().time_since_epoch() - secs_since_file_mod > gc_grace_sec - manager::hints_flush_period) {
ctx_ptr->rps_set.erase(rp);
return make_ready_future<>();
}
@@ -725,6 +726,7 @@ future<> manager::end_point_hints_manager::sender::send_one_hint(lw_shared_ptr<s
manager_logger.debug("send_hints(): {} at {}: {}", fname, rp, e.what());
++this->shard_stats().discarded;
}
ctx_ptr->rps_set.erase(rp);
return make_ready_future<>();
}).finally([units = std::move(units), ctx_ptr] {});
}).handle_exception([this, ctx_ptr] (auto eptr) {

View File

@@ -187,7 +187,7 @@ schema_ptr batchlog() {
{{"cf_id", uuid_type}},
// regular columns
{
{"in_progress_ballot", timeuuid_type},
{"promise", timeuuid_type},
{"most_recent_commit", bytes_type}, // serialization format is defined by frozen_mutation idl
{"most_recent_commit_at", timeuuid_type},
{"proposal", bytes_type}, // serialization format is defined by frozen_mutation idl
@@ -203,6 +203,7 @@ schema_ptr batchlog() {
// operations on resulting CFMetaData:
// .compactionStrategyClass(LeveledCompactionStrategy.class);
)));
builder.set_gc_grace_seconds(0);
builder.with_version(generate_schema_version(builder.uuid()));
builder.set_wait_for_sync_to_commitlog(true);
return builder.build(schema_builder::compact_storage::no);
@@ -226,6 +227,7 @@ schema_ptr built_indexes() {
// comment
"built column indexes"
)));
builder.set_gc_grace_seconds(0);
builder.with_version(generate_schema_version(builder.uuid()));
return builder.build(schema_builder::compact_storage::yes);
}();
@@ -272,6 +274,7 @@ schema_ptr built_indexes() {
// comment
"information about the local node"
)));
builder.set_gc_grace_seconds(0);
builder.with_version(generate_schema_version(builder.uuid()));
builder.remove_column("scylla_cpu_sharding_algorithm");
builder.remove_column("scylla_nr_shards");
@@ -307,6 +310,7 @@ schema_ptr built_indexes() {
// comment
"information about known peers in the cluster"
)));
builder.set_gc_grace_seconds(0);
builder.with_version(generate_schema_version(builder.uuid()));
return builder.build(schema_builder::compact_storage::no);
}();
@@ -331,6 +335,7 @@ schema_ptr built_indexes() {
// comment
"events related to peers"
)));
builder.set_gc_grace_seconds(0);
builder.with_version(generate_schema_version(builder.uuid()));
return builder.build(schema_builder::compact_storage::no);
}();
@@ -353,6 +358,7 @@ schema_ptr built_indexes() {
// comment
"ranges requested for transfer"
)));
builder.set_gc_grace_seconds(0);
builder.with_version(generate_schema_version(builder.uuid()));
return builder.build(schema_builder::compact_storage::no);
}();
@@ -490,6 +496,7 @@ schema_ptr size_estimates() {
// comment
"partitions larger than specified threshold"
)));
builder.set_gc_grace_seconds(0);
builder.with_version(generate_schema_version(builder.uuid()));
return builder.build(schema_builder::compact_storage::no);
}();
@@ -510,6 +517,7 @@ static schema_ptr large_rows() {
.with_column("compaction_time", timestamp_type)
.set_comment("rows larger than specified threshold")
.with_version(generate_schema_version(id))
.set_gc_grace_seconds(0)
.build();
}();
return large_rows;
@@ -530,6 +538,7 @@ static schema_ptr large_cells() {
.with_column("compaction_time", timestamp_type)
.set_comment("cells larger than specified threshold")
.with_version(generate_schema_version(id))
.set_gc_grace_seconds(0)
.build();
}();
return large_cells;
@@ -553,6 +562,7 @@ static schema_ptr large_cells() {
// comment
"Scylla specific information about the local node"
)));
builder.set_gc_grace_seconds(0);
builder.with_version(generate_schema_version(builder.uuid()));
return builder.build(schema_builder::compact_storage::no);
}();
@@ -666,6 +676,7 @@ schema_ptr local() {
// comment
"information about the local node"
)));
builder.set_gc_grace_seconds(0);
builder.with_version(generate_schema_version(builder.uuid()));
return builder.build(schema_builder::compact_storage::no);
}();
@@ -693,6 +704,7 @@ schema_ptr truncated() {
// comment
"information about table truncation"
)));
builder.set_gc_grace_seconds(0);
builder.with_version(generate_schema_version(builder.uuid()));
return builder.build(schema_builder::compact_storage::no);
}();
@@ -755,6 +767,7 @@ schema_ptr available_ranges() {
// comment
"available keyspace/ranges during bootstrap/replace that are ready to be served"
)));
builder.set_gc_grace_seconds(0);
builder.with_version(generate_schema_version(builder.uuid()));
return builder.build();
}();
@@ -777,6 +790,7 @@ schema_ptr views_builds_in_progress() {
// comment
"views builds current progress"
)));
builder.set_gc_grace_seconds(0);
builder.with_version(generate_schema_version(builder.uuid()));
return builder.build();
}();
@@ -799,6 +813,7 @@ schema_ptr built_views() {
// comment
"built views"
)));
builder.set_gc_grace_seconds(0);
builder.with_version(generate_schema_version(builder.uuid()));
return builder.build();
}();
@@ -842,6 +857,7 @@ schema_ptr scylla_views_builds_in_progress() {
// comment
"CDC-specific information that the local node stores"
)));
builder.set_gc_grace_seconds(0);
builder.with_version(generate_schema_version(builder.uuid()));
return builder.build(schema_builder::compact_storage::no);
}();
@@ -2196,13 +2212,13 @@ future<service::paxos::paxos_state> load_paxos_state(const partition_key& key, s
// FIXME: we need execute_cql_with_now()
(void)now;
auto f = execute_cql_with_timeout(cql, timeout, to_legacy(*key.get_compound_type(*s), key.representation()), s->id());
return f.then([s] (shared_ptr<cql3::untyped_result_set> results) mutable {
return f.then([s, key] (shared_ptr<cql3::untyped_result_set> results) mutable {
if (results->empty()) {
return service::paxos::paxos_state();
}
auto& row = results->one();
auto promised = row.has("in_progress_ballot")
? row.get_as<utils::UUID>("in_progress_ballot") : utils::UUID_gen::min_time_UUID(0);
auto promised = row.has("promise")
? row.get_as<utils::UUID>("promise") : utils::UUID_gen::min_time_UUID(0);
std::optional<service::paxos::proposal> accepted;
if (row.has("proposal")) {
@@ -2211,9 +2227,14 @@ future<service::paxos::paxos_state> load_paxos_state(const partition_key& key, s
}
std::optional<service::paxos::proposal> most_recent;
if (row.has("most_recent_commit")) {
if (row.has("most_recent_commit_at")) {
// the value can be missing if it was pruned, suply empty one since
// it will not going to be used anyway
auto fm = row.has("most_recent_commit") ?
ser::deserialize_from_buffer<>(row.get_blob("most_recent_commit"), boost::type<frozen_mutation>(), 0) :
freeze(mutation(s, key));
most_recent = service::paxos::proposal(row.get_as<utils::UUID>("most_recent_commit_at"),
ser::deserialize_from_buffer<>(row.get_blob("most_recent_commit"), boost::type<frozen_mutation>(), 0));
std::move(fm));
}
return service::paxos::paxos_state(promised, std::move(accepted), std::move(most_recent));
@@ -2228,7 +2249,7 @@ static int32_t paxos_ttl_sec(const schema& s) {
}
future<> save_paxos_promise(const schema& s, const partition_key& key, const utils::UUID& ballot, db::timeout_clock::time_point timeout) {
static auto cql = format("UPDATE system.{} USING TIMESTAMP ? AND TTL ? SET in_progress_ballot = ? WHERE row_key = ? AND cf_id = ?", PAXOS);
static auto cql = format("UPDATE system.{} USING TIMESTAMP ? AND TTL ? SET promise = ? WHERE row_key = ? AND cf_id = ?", PAXOS);
return execute_cql_with_timeout(cql,
timeout,
utils::UUID_gen::micros_timestamp(ballot),
@@ -2240,13 +2261,14 @@ future<> save_paxos_promise(const schema& s, const partition_key& key, const uti
}
future<> save_paxos_proposal(const schema& s, const service::paxos::proposal& proposal, db::timeout_clock::time_point timeout) {
static auto cql = format("UPDATE system.{} USING TIMESTAMP ? AND TTL ? SET proposal_ballot = ?, proposal = ? WHERE row_key = ? AND cf_id = ?", PAXOS);
static auto cql = format("UPDATE system.{} USING TIMESTAMP ? AND TTL ? SET promise = ?, proposal_ballot = ?, proposal = ? WHERE row_key = ? AND cf_id = ?", PAXOS);
partition_key_view key = proposal.update.key(s);
return execute_cql_with_timeout(cql,
timeout,
utils::UUID_gen::micros_timestamp(proposal.ballot),
paxos_ttl_sec(s),
proposal.ballot,
proposal.ballot,
ser::serialize_to_buffer<bytes>(proposal.update),
to_legacy(*key.get_compound_type(s), key.representation()),
s.id()
@@ -2274,6 +2296,20 @@ future<> save_paxos_decision(const schema& s, const service::paxos::proposal& de
).discard_result();
}
future<> delete_paxos_decision(const schema& s, const partition_key& key, const utils::UUID& ballot, db::timeout_clock::time_point timeout) {
// This should be called only if a learn stage succeeded on all replicas.
// In this case we can remove learned paxos value using ballot's timestamp which
// guarantees that if there is more recent round it will not be affected.
static auto cql = format("DELETE most_recent_commit FROM system.{} USING TIMESTAMP ? WHERE row_key = ? AND cf_id = ?", PAXOS);
return execute_cql_with_timeout(cql,
timeout,
utils::UUID_gen::micros_timestamp(ballot),
to_legacy(*key.get_compound_type(s), key.representation()),
s.id()
).discard_result();
}
} // namespace system_keyspace
sstring system_keyspace_name() {

View File

@@ -647,6 +647,7 @@ future<service::paxos::paxos_state> load_paxos_state(const partition_key& key, s
future<> save_paxos_promise(const schema& s, const partition_key& key, const utils::UUID& ballot, db::timeout_clock::time_point timeout);
future<> save_paxos_proposal(const schema& s, const service::paxos::proposal& proposal, db::timeout_clock::time_point timeout);
future<> save_paxos_decision(const schema& s, const service::paxos::proposal& decision, db::timeout_clock::time_point timeout);
future<> delete_paxos_decision(const schema& s, const partition_key& key, const utils::UUID& ballot, db::timeout_clock::time_point timeout);
} // namespace system_keyspace
} // namespace db

View File

@@ -1101,6 +1101,8 @@ future<> mutate_MV(
}
};
if (paired_endpoint) {
// If paired endpoint is present, remove it from the list of pending endpoints to avoid duplicates
pending_endpoints.erase(std::remove(pending_endpoints.begin(), pending_endpoints.end(), *paired_endpoint), pending_endpoints.end());
// When paired endpoint is the local node, we can just apply
// the mutation locally, unless there are pending endpoints, in
// which case we want to do an ordinary write so the view mutation

View File

@@ -118,7 +118,7 @@ token token::midpoint(const token& t1, const token& t2) {
}
token token::get_random_token() {
return {kind::key, dht::get_random_number<int64_t>()};
return token(kind::key, dht::get_random_number<uint64_t>());
}
token token::from_sstring(const sstring& t) {

View File

@@ -58,19 +58,27 @@ public:
, _data(normalize(d)) { }
token(kind k, const bytes& b) : _kind(std::move(k)) {
if (b.size() != sizeof(_data)) {
throw std::runtime_error(fmt::format("Wrong token bytes size: expected {} but got {}", sizeof(_data), b.size()));
if (_kind != kind::key) {
_data = 0;
} else {
if (b.size() != sizeof(_data)) {
throw std::runtime_error(fmt::format("Wrong token bytes size: expected {} but got {}", sizeof(_data), b.size()));
}
std::copy_n(b.begin(), sizeof(_data), reinterpret_cast<int8_t *>(&_data));
_data = net::ntoh(_data);
}
std::copy_n(b.begin(), sizeof(_data), reinterpret_cast<int8_t *>(&_data));
_data = net::ntoh(_data);
}
token(kind k, bytes_view b) : _kind(std::move(k)) {
if (b.size() != sizeof(_data)) {
throw std::runtime_error(fmt::format("Wrong token bytes size: expected {} but got {}", sizeof(_data), b.size()));
if (_kind != kind::key) {
_data = 0;
} else {
if (b.size() != sizeof(_data)) {
throw std::runtime_error(fmt::format("Wrong token bytes size: expected {} but got {}", sizeof(_data), b.size()));
}
std::copy_n(b.begin(), sizeof(_data), reinterpret_cast<int8_t *>(&_data));
_data = net::ntoh(_data);
}
std::copy_n(b.begin(), sizeof(_data), reinterpret_cast<int8_t *>(&_data));
_data = net::ntoh(_data);
}
bool is_minimum() const {

View File

@@ -61,7 +61,15 @@ def sh_command(*args):
return out
def get_url(path):
return urllib.request.urlopen(path).read().decode('utf-8')
# If server returns any error, like 403, or 500 urllib.request throws exception, which is not serializable.
# When multiprocessing routines fail to serialize it, it throws ambiguous serialization exception
# from get_json_from_url.
# In order to see legit error we catch it from the inside of process, covert to string and
# pass it as part of return value
try:
return 0, urllib.request.urlopen(path).read().decode('utf-8')
except Exception as exc:
return 1, str(exc)
def get_json_from_url(path):
pool = mp.Pool(processes=1)
@@ -71,13 +79,16 @@ def get_json_from_url(path):
# to enforce a wallclock timeout.
result = pool.apply_async(get_url, args=(path,))
try:
retval = result.get(timeout=5)
status, retval = result.get(timeout=5)
except mp.TimeoutError as err:
pool.terminate()
pool.join()
raise
if status == 1:
raise RuntimeError(f'Failed to get "{path}" due to the following error: {retval}')
return json.loads(retval)
def get_api(path):
return get_json_from_url("http://" + api_address + path)

View File

@@ -31,5 +31,6 @@ if __name__ == '__main__':
sys.exit(1)
if is_systemd():
systemd_unit('scylla-fstrim.timer').unmask()
systemd_unit('scylla-fstrim.timer').enable()
if is_redhat_variant():
systemd_unit('fstrim.timer').disable()

View File

@@ -371,6 +371,9 @@ if __name__ == '__main__':
if not stat.S_ISBLK(os.stat(dsk).st_mode):
print('{} is not block device'.format(dsk))
continue
if dsk in selected:
print(f'{dsk} is already added')
continue
selected.append(dsk)
devices.remove(dsk)
disks = ','.join(selected)

View File

@@ -182,7 +182,7 @@ class aws_instance:
instance_size = self.instance_size()
if instance_class in ['c3', 'c4', 'd2', 'i2', 'r3']:
return 'ixgbevf'
if instance_class in ['c5', 'c5d', 'f1', 'g3', 'h1', 'i3', 'i3en', 'm5', 'm5d', 'p2', 'p3', 'r4', 'x1']:
if instance_class in ['a1', 'c5', 'c5d', 'f1', 'g3', 'g4', 'h1', 'i3', 'i3en', 'inf1', 'm5', 'm5a', 'm5ad', 'm5d', 'm5dn', 'm5n', 'm6g', 'p2', 'p3', 'r4', 'r5', 'r5a', 'r5ad', 'r5d', 'r5dn', 'r5n', 't3', 't3a', 'u-6tb1', 'u-9tb1', 'u-12tb1', 'u-18tn1', 'u-24tb1', 'x1', 'x1e', 'z1d']:
return 'ena'
if instance_class == 'm4':
if instance_size == '16xlarge':
@@ -481,8 +481,8 @@ def parse_scylla_dirs_with_default(conf='/etc/scylla/scylla.yaml'):
y['data_file_directories'] = [os.path.join(y['workdir'], 'data')]
for t in [ "commitlog", "hints", "view_hints", "saved_caches" ]:
key = "%s_directory" % t
if key not in y or not y[k]:
y[k] = os.path.join(y['workdir'], t)
if key not in y or not y[key]:
y[key] = os.path.join(y['workdir'], t)
return y

View File

@@ -5,8 +5,8 @@ MAINTAINER Avi Kivity <avi@cloudius-systems.com>
ENV container docker
# The SCYLLA_REPO_URL argument specifies the URL to the RPM repository this Docker image uses to install Scylla. The default value is the Scylla's unstable RPM repository, which contains the daily build.
ARG SCYLLA_REPO_URL=http://downloads.scylladb.com/rpm/unstable/centos/master/latest/scylla.repo
ARG VERSION=666.development
ARG SCYLLA_REPO_URL=http://downloads.scylladb.com/rpm/unstable/centos/branch-4.0/latest/scylla.repo
ARG VERSION=4.0.*
ADD scylla_bashrc /scylla_bashrc

View File

@@ -21,10 +21,6 @@ DynamoDB API requests.
For example., "`--alternator-port=8000`" on the command line will run
Alternator on port 8000 - the traditional port used by DynamoDB.
Alternator uses Scylla's LWT feature, which is currently considered
experimental and needs to be seperately enabled as well, e.g. with the
"`--experimental=on`" option.
By default, Scylla listens on this port on all network interfaces.
To listen only on a specific interface, pass also an "`alternator-address`"
option.
@@ -55,9 +51,8 @@ Alternator's compatibility with DynamoDB, and will be updated as the work
progresses and compatibility continues to improve.
### API Server
* Transport: HTTP mostly supported, but small features like CRC header and
compression are still missing. HTTPS supported on top of HTTP, so small
features may still be missing.
* Transport: HTTP and HTTPS are mostly supported, but small features like CRC
header and compression are still missing.
* Authorization (verifying the originator of the request): implemented
on top of system\_auth.roles table. The secret key used for authorization
is the salted\_hash column from the roles table, selected with:
@@ -65,20 +60,19 @@ progresses and compatibility continues to improve.
By default, authorization is not enforced at all. It can be turned on
by providing an entry in Scylla configuration:
alternator\_enforce\_authorization: true
* DNS server for load balancing: Not yet supported. Client needs to pick
one of the live Scylla nodes and send a request to it.
* Load balancing: Not a part of Alternator. One should use an external load
balancer or DNS server to balance the requests between the live Scylla
nodes. We plan to publish a reference example soon.
### Table Operations
* CreateTable: Supported. Note our implementation is synchronous.
* CreateTable and DeleteTable: Supported. Note our implementation is synchronous.
* DescribeTable: Partial implementation. Missing creation date and size estimate.
* UpdateTable: Not supported.
* DescribeTable: Partial implementation. Missing creation date and size esitmate.
* DeleteTable: Supported. Note our implementation is synchronous.
* ListTables: Supported.
### Item Operations
* GetItem: Support almost complete except that projection expressions can
only ask for top-level attributes.
* PutItem: Support almost complete except that condition expressions can
only refer to to-level attributes.
pre-put content) not yet supported.
* UpdateItem: Nested documents are supported but updates to nested attributes
are not (e.g., `SET a.b[3].c=val`), and neither are nested attributes in
condition expressions.
@@ -90,15 +84,14 @@ progresses and compatibility continues to improve.
* BatchWriteItem: Supported. Doesn't limit the number of items (DynamoDB
limits to 25) or size of items (400 KB) or total request size (16 MB).
### Scans
* Scan: As usual, projection expressions only support top-level attributes.
Filter expressions (to filter some of the items) partially supported:
The ScanFilter syntax is supported but FilterExpression is not yet, and
only equality operator is supported so far.
The "Select" options which allows to count items instead of returning them
is not yet supported. Parallel scan is not yet supported.
* Query: Same issues as Scan above. Additionally, missing support for
KeyConditionExpression (an alternative syntax replacing the older
KeyConditions parameter which we do support).
Scan and Query are mostly supported, with the following limitations:
* As above, projection expressions only support top-level attributes.
* Filter expressions (to filter some of the items) are only partially
supported: The ScanFilter syntax is currently only supports the equality
operator, and the FilterExpression syntax is not yet supported at all.
* The "Select" options which allows to count items instead of returning them
is not yet supported.
* Parallel scan is not yet supported.
### Secondary Indexes
Global Secondary Indexes (GSI) and Local Secondary Indexes (LSI) are
implemented, with the following limitations:
@@ -116,24 +109,28 @@ implemented, with the following limitations:
Writes are done in LOCAL_QURUM and reads in LOCAL_ONE (eventual consistency)
or LOCAL_QUORUM (strong consistency).
### Global Tables
* Not yet supported: CreateGlobalTable, UpdateGlobalTable,
DescribeGlobalTable, ListGlobalTables, UpdateGlobalTableSettings,
DescribeGlobalTableSettings. Implementation will use Scylla's multi-DC
features.
* Currently, *all* Alternator tables are created as "Global Tables", i.e., can
be accessed from all of Scylla's DCs.
* We do not yet support the DynamoDB API calls to make some of the tables
global and others local to a particular DC: CreateGlobalTable,
UpdateGlobalTable, DescribeGlobalTable, ListGlobalTables,
UpdateGlobalTableSettings, DescribeGlobalTableSettings, and UpdateTable.
### Backup and Restore
* On-demand backup: Not yet supported: CreateBackup, DescribeBackup,
DeleteBackup, ListBackups, RestoreTableFromBackup. Implementation will
use Scylla's snapshots
* On-demand backup: the DynamoDB APIs are not yet supported: CreateBackup,
DescribeBackup, DeleteBackup, ListBackups, RestoreTableFromBackup.
Users can use Scylla's [snapshots](https://docs.scylladb.com/operating-scylla/procedures/backup-restore/)
or [Scylla Manager](https://docs.scylladb.com/operating-scylla/manager/2.0/backup/).
* Continuous backup: Not yet supported: UpdateContinuousBackups,
DescribeContinuousBackups, RestoreTableToPoinInTime.
### Transations
### Transactions
* Not yet supported: TransactWriteItems, TransactGetItems.
Note that this is a new DynamoDB feature - these are more powerful than
the old conditional updates which were "lightweight transactions".
### Streams (CDC)
* Not yet supported
### Streams
* Scylla has experimental support for [CDC](https://docs.scylladb.com/using-scylla/cdc/)
(change data capture), but the "DynamoDB Streams" API is not yet supported.
### Encryption at rest
* Supported natively by Scylla, but needs to be enabled by default.
* Supported by Scylla Enterprise (not in open-source). Needs to be enabled.
### ARNs and tags
* ARN is generated for every alternator table
* Tagging can be used with the help of the following requests:
@@ -166,7 +163,9 @@ implemented, with the following limitations:
* Not required. Scylla cache is rather advanced and there is no need to place
a cache in front of the database: https://www.scylladb.com/2017/07/31/database-caches-not-good/
### Metrics
* Several metrics are available through the Grafana/Promethues stack: https://docs.scylladb.com/operating-scylla/monitoring/ It is different than the expectations of the current DynamoDB implementation. However, our
* Several metrics are available through the Grafana/Prometheus stack:
https://docs.scylladb.com/operating-scylla/monitoring/
Those are different from the current DynamoDB metrics, but Scylla's
monitoring is rather advanced and provide more insights to the internals.
## Alternator design and implementation
@@ -229,8 +228,3 @@ one DynamoDB feature which we cannot support safely: we cannot modify
a non-top-level attribute (e.g., a.b[3].c) directly without RMW. We plan
to fix this in a future version by rethinking the data model we use for
attributes, or rethinking our implementation of RMW (as explained above).
For reasons explained above, the data model used by Alternator to store
data on disk is still in a state of flux, and may change in future versions.
Therefore, in this early stage it is not recommended to store important
production data using Alternator.

View File

@@ -10,12 +10,10 @@ This section will guide you through the steps for setting up the cluster:
nightly image by running: `docker pull scylladb/scylla-nightly:latest`
2. Follow the steps in the [Scylla official download web page](https://www.scylladb.com/download/open-source/#docker)
add to every "docker run" command: `-p 8000:8000` before the image name
and `--alternator-port=8000 --experimental 1` at the end. The
"alternator-port" option specifies on which port Scylla will listen for
the (unencrypted) DynamoDB API, and "--experimental 1" is required to
enable the experimental LWT feature which Alternator uses.
and `--alternator-port=8000` at the end. The "alternator-port" option
specifies on which port Scylla will listen for the (unencrypted) DynamoDB API.
For example,
`docker run --name scylla -d -p 8000:8000 scylladb/scylla-nightly:latest --alternator-port=8000 --experimental 1`
`docker run --name scylla -d -p 8000:8000 scylladb/scylla-nightly:latest --alternator-port=8000
## Testing Scylla's DynamoDB API support:
### Running AWS Tic Tac Toe demo app to test the cluster:

View File

@@ -76,6 +76,9 @@ Scylla with issue #4139 fixed)
bit 4: CorrectEmptyCounters (if set, indicates the sstable was generated by
Scylla with issue #4363 fixed)
bit 5: CorrectUDTsInCollections (if set, indicates that the sstable was generated
by Scylla with issue #6130 fixed)
## extension_attributes subcomponent
extension_attributes = extension_attribute_count extension_attribute*

View File

@@ -56,22 +56,22 @@ public:
/**
* The unrecognized entity.
*/
::shared_ptr<cql3::column_identifier> entity;
cql3::column_identifier entity;
/**
* The entity relation.
* The entity relation in a stringified form.
*/
cql3::relation_ptr relation;
sstring relation_str;
/**
* Creates a new <code>UnrecognizedEntityException</code>.
* @param entity the unrecognized entity
* @param relation the entity relation
* @param relation_str the entity relation string
*/
unrecognized_entity_exception(::shared_ptr<cql3::column_identifier> entity, cql3::relation_ptr relation)
: invalid_request_exception(format("Undefined name {} in where clause ('{}')", *entity, relation->to_string()))
, entity(entity)
, relation(relation)
unrecognized_entity_exception(cql3::column_identifier entity, sstring relation_str)
: invalid_request_exception(format("Undefined name {} in where clause ('{}')", entity, relation_str))
, entity(std::move(entity))
, relation_str(std::move(relation_str))
{ }
};

View File

@@ -110,10 +110,6 @@ feature_config feature_config_from_db_config(db::config& cfg) {
fcfg.enable_cdc = true;
}
if (cfg.check_experimental(db::experimental_features_t::LWT)) {
fcfg.enable_lwt = true;
}
return fcfg;
}
@@ -178,9 +174,7 @@ std::set<std::string_view> feature_service::known_feature_set() {
if (_config.enable_cdc) {
features.insert(gms::features::CDC);
}
if (_config.enable_lwt) {
features.insert(gms::features::LWT);
}
features.insert(gms::features::LWT);
for (const sstring& s : _config.disabled_features) {
features.erase(s);

View File

@@ -41,7 +41,6 @@ struct feature_config {
bool enable_sstables_mc_format = false;
bool enable_user_defined_functions = false;
bool enable_cdc = false;
bool enable_lwt = false;
std::set<sstring> disabled_features;
feature_config();
};

View File

@@ -632,7 +632,7 @@ void gossiper::remove_endpoint(inet_address endpoint) {
// We can not run on_remove callbacks here becasue on_remove in
// storage_service might take the gossiper::timer_callback_lock
(void)seastar::async([this, endpoint] {
_subscribers.for_each([endpoint] (auto& subscriber) {
_subscribers.for_each([endpoint] (shared_ptr<i_endpoint_state_change_subscriber> subscriber) {
subscriber->on_remove(endpoint);
});
}).handle_exception([] (auto ep) {
@@ -1464,7 +1464,7 @@ void gossiper::real_mark_alive(inet_address addr, endpoint_state& local_state) {
logger.info("InetAddress {} is now UP, status = {}", addr, status);
}
_subscribers.for_each([addr, local_state] (auto& subscriber) {
_subscribers.for_each([addr, local_state] (shared_ptr<i_endpoint_state_change_subscriber> subscriber) {
subscriber->on_alive(addr, local_state);
logger.trace("Notified {}", subscriber.get());
});
@@ -1478,7 +1478,7 @@ void gossiper::mark_dead(inet_address addr, endpoint_state& local_state) {
_live_endpoints_just_added.remove(addr);
_unreachable_endpoints[addr] = now();
logger.info("InetAddress {} is now DOWN, status = {}", addr, get_gossip_status(local_state));
_subscribers.for_each([addr, local_state] (auto& subscriber) {
_subscribers.for_each([addr, local_state] (shared_ptr<i_endpoint_state_change_subscriber> subscriber) {
subscriber->on_dead(addr, local_state);
logger.trace("Notified {}", subscriber.get());
});
@@ -1510,7 +1510,7 @@ void gossiper::handle_major_state_change(inet_address ep, const endpoint_state&
if (eps_old) {
// the node restarted: it is up to the subscriber to take whatever action is necessary
_subscribers.for_each([ep, eps_old] (auto& subscriber) {
_subscribers.for_each([ep, eps_old] (shared_ptr<i_endpoint_state_change_subscriber> subscriber) {
subscriber->on_restart(ep, *eps_old);
});
}
@@ -1525,7 +1525,7 @@ void gossiper::handle_major_state_change(inet_address ep, const endpoint_state&
auto* eps_new = get_endpoint_state_for_endpoint_ptr(ep);
if (eps_new) {
_subscribers.for_each([ep, eps_new] (auto& subscriber) {
_subscribers.for_each([ep, eps_new] (shared_ptr<i_endpoint_state_change_subscriber> subscriber) {
subscriber->on_join(ep, *eps_new);
});
}
@@ -1618,14 +1618,14 @@ void gossiper::apply_new_states(inet_address addr, endpoint_state& local_state,
// Runs inside seastar::async context
void gossiper::do_before_change_notifications(inet_address addr, const endpoint_state& ep_state, const application_state& ap_state, const versioned_value& new_value) {
_subscribers.for_each([addr, ep_state, ap_state, new_value] (auto& subscriber) {
_subscribers.for_each([addr, ep_state, ap_state, new_value] (shared_ptr<i_endpoint_state_change_subscriber> subscriber) {
subscriber->before_change(addr, ep_state, ap_state, new_value);
});
}
// Runs inside seastar::async context
void gossiper::do_on_change_notifications(inet_address addr, const application_state& state, const versioned_value& value) {
_subscribers.for_each([addr, state, value] (auto& subscriber) {
_subscribers.for_each([addr, state, value] (shared_ptr<i_endpoint_state_change_subscriber> subscriber) {
subscriber->on_change(addr, state, value);
});
}
@@ -1725,8 +1725,12 @@ future<> gossiper::start_gossiping(int generation_nbr, std::map<application_stat
// message on all cpus and forard them to cpu0 to process.
return get_gossiper().invoke_on_all([do_bind] (gossiper& g) {
g.init_messaging_service_handler(do_bind);
}).then([this, generation_nbr, preload_local_states] {
}).then([this, generation_nbr, preload_local_states] () mutable {
build_seeds_list();
if (_cfg.force_gossip_generation() > 0) {
generation_nbr = _cfg.force_gossip_generation();
logger.warn("Use the generation number provided by user: generation = {}", generation_nbr);
}
endpoint_state& local_state = endpoint_state_map[get_broadcast_address()];
local_state.set_heart_beat_state_and_update_timestamp(heart_beat_state(generation_nbr));
local_state.mark_alive();

View File

@@ -591,6 +591,7 @@ public:
std::map<sstring, sstring> get_simple_states();
int get_down_endpoint_count();
int get_up_endpoint_count();
int get_all_endpoint_count();
sstring get_endpoint_state(sstring address);
failure_detector& fd() { return _fd; }
};
@@ -637,6 +638,12 @@ inline future<int> get_up_endpoint_count() {
});
}
inline future<int> get_all_endpoint_count() {
return smp::submit_to(0, [] {
return static_cast<int>(get_local_gossiper().get_endpoint_states().size());
});
}
inline future<> set_phi_convict_threshold(double phi) {
return smp::submit_to(0, [phi] {
get_local_gossiper().fd().set_phi_convict_threshold(phi);

View File

@@ -69,7 +69,8 @@ std::ostream& gms::operator<<(std::ostream& os, const inet_address& x) {
auto&& bytes = x.bytes();
auto i = 0u;
auto acc = 0u;
for (auto b : bytes) {
// extra paranoid sign extension evasion - #5808
for (uint8_t b : bytes) {
acc <<= 8;
acc |= b;
if ((++i & 1) == 0) {

View File

@@ -76,6 +76,8 @@ fedora_packages=(
python3-psutil
python3-cassandra-driver
python3-colorama
python3-boto3
python3-pytest
dnf-utils
pigz
net-tools

10
lua.cc
View File

@@ -264,14 +264,12 @@ static auto visit_lua_raw_value(lua_State* l, int index, Func&& f) {
template <typename Func>
static auto visit_decimal(const big_decimal &v, Func&& f) {
boost::multiprecision::cpp_int ten(10);
const auto& dividend = v.unscaled_value();
auto divisor = boost::multiprecision::pow(ten, v.scale());
boost::multiprecision::cpp_rational r = v.as_rational();
const boost::multiprecision::cpp_int& dividend = numerator(r);
const boost::multiprecision::cpp_int& divisor = denominator(r);
if (dividend % divisor == 0) {
return f(utils::multiprecision_int(boost::multiprecision::cpp_int(dividend/divisor)));
return f(utils::multiprecision_int(dividend/divisor));
}
boost::multiprecision::cpp_rational r = dividend;
r /= divisor;
return f(r.convert_to<double>());
}

27
main.cc
View File

@@ -546,9 +546,13 @@ int main(int ac, char** av) {
gms::feature_config fcfg = gms::feature_config_from_db_config(*cfg);
feature_service.start(fcfg).get();
auto stop_feature_service = defer_verbose_shutdown("feature service", [&feature_service] {
feature_service.stop().get();
});
// FIXME storage_proxy holds a reference on it and is not yet stopped.
// also the proxy leaves range_slice_read_executor-s hanging around
// and willing to find out if the cluster_supports_digest_multipartition_reads
//
//auto stop_feature_service = defer_verbose_shutdown("feature service", [&feature_service] {
// feature_service.stop().get();
//});
schema::set_default_partitioner(cfg->partitioner(), cfg->murmur3_partitioner_ignore_msb_bits());
auto make_sched_group = [&] (sstring name, unsigned shares) {
@@ -662,9 +666,17 @@ int main(int ac, char** av) {
supervisor::notify("starting tokens manager");
token_metadata.start().get();
auto stop_token_metadata = defer_verbose_shutdown("token metadata", [ &token_metadata ] {
token_metadata.stop().get();
});
// storage_proxy holds a reference on it and is not yet stopped.
// what's worse is that the calltrace
// storage_proxy::do_query
// ::query_partition_key_range
// ::query_partition_key_range_concurrent
// leaves unwaited futures on the reactor and once it gets there
// the token_metadata instance is accessed and ...
//
//auto stop_token_metadata = defer_verbose_shutdown("token metadata", [ &token_metadata ] {
// token_metadata.stop().get();
//});
supervisor::notify("starting migration manager notifier");
mm_notifier.start().get();
@@ -1071,9 +1083,6 @@ int main(int ac, char** av) {
static sharded<alternator::executor> alternator_executor;
static sharded<alternator::server> alternator_server;
if (!cfg->check_experimental(db::experimental_features_t::LWT)) {
throw std::runtime_error("Alternator enabled, but needs experimental LWT feature which wasn't enabled");
}
net::inet_address addr;
try {
addr = net::dns::get_host_by_name(cfg->alternator_address(), family).get0().addr_list.front();

View File

@@ -452,6 +452,7 @@ static constexpr unsigned do_get_rpc_client_idx(messaging_verb verb) {
case messaging_verb::PAXOS_PREPARE:
case messaging_verb::PAXOS_ACCEPT:
case messaging_verb::PAXOS_LEARN:
case messaging_verb::PAXOS_PRUNE:
return 0;
// GET_SCHEMA_VERSION is sent from read/mutate verbs so should be
// sent on a different connection to avoid potential deadlocks
@@ -1179,14 +1180,14 @@ future<> messaging_service::send_repair_put_row_diff(msg_addr id, uint32_t repai
}
// Wrapper for REPAIR_ROW_LEVEL_START
void messaging_service::register_repair_row_level_start(std::function<future<> (const rpc::client_info& cinfo, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name, table_schema_version schema_version)>&& func) {
void messaging_service::register_repair_row_level_start(std::function<future<> (const rpc::client_info& cinfo, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name, table_schema_version schema_version, rpc::optional<streaming::stream_reason> reason)>&& func) {
register_handler(this, messaging_verb::REPAIR_ROW_LEVEL_START, std::move(func));
}
future<> messaging_service::unregister_repair_row_level_start() {
return unregister_handler(messaging_verb::REPAIR_ROW_LEVEL_START);
}
future<> messaging_service::send_repair_row_level_start(msg_addr id, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name, table_schema_version schema_version) {
return send_message<void>(this, messaging_verb::REPAIR_ROW_LEVEL_START, std::move(id), repair_meta_id, std::move(keyspace_name), std::move(cf_name), std::move(range), algo, max_row_buf_size, seed, remote_shard, remote_shard_count, remote_ignore_msb, std::move(remote_partitioner_name), std::move(schema_version));
future<> messaging_service::send_repair_row_level_start(msg_addr id, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name, table_schema_version schema_version, streaming::stream_reason reason) {
return send_message<void>(this, messaging_verb::REPAIR_ROW_LEVEL_START, std::move(id), repair_meta_id, std::move(keyspace_name), std::move(cf_name), std::move(range), algo, max_row_buf_size, seed, remote_shard, remote_shard_count, remote_ignore_msb, std::move(remote_partitioner_name), std::move(schema_version), reason);
}
// Wrapper for REPAIR_ROW_LEVEL_STOP
@@ -1281,6 +1282,19 @@ future<> messaging_service::send_paxos_learn(msg_addr id, clock_type::time_point
std::move(reply_to), shard, std::move(response_id), std::move(trace_info));
}
void messaging_service::register_paxos_prune(std::function<future<rpc::no_wait_type>(
const rpc::client_info&, rpc::opt_time_point, UUID schema_id, partition_key key, utils::UUID ballot, std::optional<tracing::trace_info>)>&& func) {
register_handler(this, messaging_verb::PAXOS_PRUNE, std::move(func));
}
future<> messaging_service::unregister_paxos_prune() {
return unregister_handler(netw::messaging_verb::PAXOS_PRUNE);
}
future<>
messaging_service::send_paxos_prune(gms::inet_address peer, clock_type::time_point timeout, UUID schema_id,
const partition_key& key, utils::UUID ballot, std::optional<tracing::trace_info> trace_info) {
return send_message_oneway_timeout(this, timeout, messaging_verb::PAXOS_PRUNE, netw::msg_addr(peer), schema_id, key, ballot, std::move(trace_info));
}
void messaging_service::register_hint_mutation(std::function<future<rpc::no_wait_type> (const rpc::client_info&, rpc::opt_time_point, frozen_mutation fm, std::vector<inet_address> forward,
inet_address reply_to, unsigned shard, response_id_type response_id, rpc::optional<std::optional<tracing::trace_info>> trace_info)>&& func) {
register_handler(this, netw::messaging_verb::HINT_MUTATION, std::move(func));

View File

@@ -139,7 +139,8 @@ enum class messaging_verb : int32_t {
PAXOS_ACCEPT = 40,
PAXOS_LEARN = 41,
HINT_MUTATION = 42,
LAST = 43,
PAXOS_PRUNE = 43,
LAST = 44,
};
} // namespace netw
@@ -341,9 +342,9 @@ public:
future<> send_repair_put_row_diff(msg_addr id, uint32_t repair_meta_id, repair_rows_on_wire row_diff);
// Wrapper for REPAIR_ROW_LEVEL_START
void register_repair_row_level_start(std::function<future<> (const rpc::client_info& cinfo, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name, table_schema_version schema_version)>&& func);
void register_repair_row_level_start(std::function<future<> (const rpc::client_info& cinfo, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name, table_schema_version schema_version, rpc::optional<streaming::stream_reason> reason)>&& func);
future<> unregister_repair_row_level_start();
future<> send_repair_row_level_start(msg_addr id, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name, table_schema_version schema_version);
future<> send_repair_row_level_start(msg_addr id, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed, unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name, table_schema_version schema_version, streaming::stream_reason reason);
// Wrapper for REPAIR_ROW_LEVEL_STOP
void register_repair_row_level_stop(std::function<future<> (const rpc::client_info& cinfo, uint32_t repair_meta_id, sstring keyspace_name, sstring cf_name, dht::token_range range)>&& func);
@@ -493,6 +494,14 @@ public:
std::vector<inet_address> forward, inet_address reply_to, unsigned shard, response_id_type response_id,
std::optional<tracing::trace_info> trace_info = std::nullopt);
void register_paxos_prune(std::function<future<rpc::no_wait_type>(const rpc::client_info&, rpc::opt_time_point, UUID schema_id, partition_key key,
utils::UUID ballot, std::optional<tracing::trace_info>)>&& func);
future<> unregister_paxos_prune();
future<> send_paxos_prune(gms::inet_address peer, clock_type::time_point timeout, UUID schema_id, const partition_key& key,
utils::UUID ballot, std::optional<tracing::trace_info> trace_info);
void register_hint_mutation(std::function<future<rpc::no_wait_type> (const rpc::client_info&, rpc::opt_time_point, frozen_mutation fm, std::vector<inet_address> forward,
inet_address reply_to, unsigned shard, response_id_type response_id, rpc::optional<std::optional<tracing::trace_info>> trace_info)>&& func);
future<> unregister_hint_mutation();

View File

@@ -2505,7 +2505,8 @@ mutation_partition::fully_discontinuous(const schema& s, const position_range& r
future<mutation_opt> counter_write_query(schema_ptr s, const mutation_source& source,
const dht::decorated_key& dk,
const query::partition_slice& slice,
tracing::trace_state_ptr trace_ptr)
tracing::trace_state_ptr trace_ptr,
db::timeout_clock::time_point timeout)
{
struct range_and_reader {
dht::partition_range range;
@@ -2530,7 +2531,7 @@ future<mutation_opt> counter_write_query(schema_ptr s, const mutation_source& so
auto cwqrb = counter_write_query_result_builder(*s);
auto cfq = make_stable_flattened_mutations_consumer<compact_for_query<emit_only_live_rows::yes, counter_write_query_result_builder>>(
*s, gc_clock::now(), slice, query::max_rows, query::max_rows, std::move(cwqrb));
auto f = r_a_r->reader.consume(std::move(cfq), db::no_timeout);
auto f = r_a_r->reader.consume(std::move(cfq), timeout);
return f.finally([r_a_r = std::move(r_a_r)] { });
}
@@ -2605,7 +2606,7 @@ void mutation_cleaner_impl::start_worker() {
stop_iteration mutation_cleaner_impl::merge_some(partition_snapshot& snp) noexcept {
auto&& region = snp.region();
return with_allocator(region.allocator(), [&] {
return with_linearized_managed_bytes([&] {
{
// Allocating sections require the region to be reclaimable
// which means that they cannot be nested.
// It is, however, possible, that if the snapshot is taken
@@ -2617,13 +2618,15 @@ stop_iteration mutation_cleaner_impl::merge_some(partition_snapshot& snp) noexce
}
try {
return _worker_state->alloc_section(region, [&] {
return with_linearized_managed_bytes([&] {
return snp.merge_partition_versions(_app_stats);
});
});
} catch (...) {
// Merging failed, give up as there is no guarantee of forward progress.
return stop_iteration::yes;
}
});
}
});
}

View File

@@ -206,5 +206,6 @@ public:
future<mutation_opt> counter_write_query(schema_ptr, const mutation_source&,
const dht::decorated_key& dk,
const query::partition_slice& slice,
tracing::trace_state_ptr trace_ptr);
tracing::trace_state_ptr trace_ptr,
db::timeout_clock::time_point timeout);

View File

@@ -173,6 +173,13 @@ future<> multishard_writer::distribute_mutation_fragments() {
return handle_end_of_stream();
}
});
}).handle_exception([this] (std::exception_ptr ep) {
for (auto& q : _queue_reader_handles) {
if (q) {
q->abort(ep);
}
}
return make_exception_future<>(std::move(ep));
});
}

View File

@@ -12,7 +12,11 @@
# At the end of the build we check that the build-id is indeed in the
# first page. At install time we check that patchelf doesn't modify
# the program headers.
# gdb has a SO_NAME_MAX_PATH_SIZE of 512, so limit the path size to
# that. The 512 includes the null at the end, hence the 511 bellow.
ORIGINAL_DYNAMIC_LINKER=$(gcc -### /dev/null -o t 2>&1 | perl -n -e '/-dynamic-linker ([^ ]*) / && print $1')
DYNAMIC_LINKER=$(printf "%2000s$ORIGINAL_DYNAMIC_LINKER" | sed 's| |/|g')
DYNAMIC_LINKER=$(printf "%511s$ORIGINAL_DYNAMIC_LINKER" | sed 's| |/|g')
echo $DYNAMIC_LINKER

View File

@@ -672,7 +672,8 @@ repair_info::repair_info(seastar::sharded<database>& db_,
const std::vector<sstring>& cfs_,
int id_,
const std::vector<sstring>& data_centers_,
const std::vector<sstring>& hosts_)
const std::vector<sstring>& hosts_,
streaming::stream_reason reason_)
: db(db_)
, partitioner(get_partitioner_for_tables(db_, keyspace_, cfs_))
, keyspace(keyspace_)
@@ -682,6 +683,7 @@ repair_info::repair_info(seastar::sharded<database>& db_,
, shard(engine().cpu_id())
, data_centers(data_centers_)
, hosts(hosts_)
, reason(reason_)
, _row_level_repair(db.local().features().cluster_supports_row_level_repair()) {
}
@@ -1462,7 +1464,7 @@ static int do_repair_start(seastar::sharded<database>& db, sstring keyspace,
data_centers = options.data_centers, hosts = options.hosts] (database& localdb) mutable {
auto ri = make_lw_shared<repair_info>(db,
std::move(keyspace), std::move(ranges), std::move(cfs),
id, std::move(data_centers), std::move(hosts));
id, std::move(data_centers), std::move(hosts), streaming::stream_reason::repair);
return repair_ranges(ri);
});
repair_results.push_back(std::move(f));
@@ -1524,14 +1526,15 @@ future<> repair_abort_all(seastar::sharded<database>& db) {
future<> sync_data_using_repair(seastar::sharded<database>& db,
sstring keyspace,
dht::token_range_vector ranges,
std::unordered_map<dht::token_range, repair_neighbors> neighbors) {
std::unordered_map<dht::token_range, repair_neighbors> neighbors,
streaming::stream_reason reason) {
if (ranges.empty()) {
return make_ready_future<>();
}
return smp::submit_to(0, [&db, keyspace = std::move(keyspace), ranges = std::move(ranges), neighbors = std::move(neighbors)] () mutable {
return smp::submit_to(0, [&db, keyspace = std::move(keyspace), ranges = std::move(ranges), neighbors = std::move(neighbors), reason] () mutable {
int id = repair_tracker().next_repair_command();
rlogger.info("repair id {} to sync data for keyspace={}, status=started", id, keyspace);
return repair_tracker().run(id, [id, &db, keyspace, ranges = std::move(ranges), neighbors = std::move(neighbors)] () mutable {
return repair_tracker().run(id, [id, &db, keyspace, ranges = std::move(ranges), neighbors = std::move(neighbors), reason] () mutable {
auto cfs = list_column_families(db.local(), keyspace);
if (cfs.empty()) {
rlogger.warn("repair id {} to sync data for keyspace={}, no table in this keyspace", id, keyspace);
@@ -1540,12 +1543,12 @@ future<> sync_data_using_repair(seastar::sharded<database>& db,
std::vector<future<>> repair_results;
repair_results.reserve(smp::count);
for (auto shard : boost::irange(unsigned(0), smp::count)) {
auto f = db.invoke_on(shard, [keyspace, cfs, id, ranges, neighbors] (database& localdb) mutable {
auto f = db.invoke_on(shard, [keyspace, cfs, id, ranges, neighbors, reason] (database& localdb) mutable {
auto data_centers = std::vector<sstring>();
auto hosts = std::vector<sstring>();
auto ri = make_lw_shared<repair_info>(service::get_local_storage_service().db(),
std::move(keyspace), std::move(ranges), std::move(cfs),
id, std::move(data_centers), std::move(hosts));
id, std::move(data_centers), std::move(hosts), reason);
ri->neighbors = std::move(neighbors);
return repair_ranges(ri);
});
@@ -1584,6 +1587,7 @@ future<> bootstrap_with_repair(seastar::sharded<database>& db, locator::token_me
auto keyspaces = db.local().get_non_system_keyspaces();
rlogger.info("bootstrap_with_repair: started with keyspaces={}", keyspaces);
auto myip = utils::fb_utilities::get_broadcast_address();
auto reason = streaming::stream_reason::bootstrap;
for (auto& keyspace_name : keyspaces) {
if (!db.local().has_keyspace(keyspace_name)) {
rlogger.info("bootstrap_with_repair: keyspace={} does not exist any more, ignoring it", keyspace_name);
@@ -1716,7 +1720,7 @@ future<> bootstrap_with_repair(seastar::sharded<database>& db, locator::token_me
}
}
auto nr_ranges = desired_ranges.size();
sync_data_using_repair(db, keyspace_name, std::move(desired_ranges), std::move(range_sources)).get();
sync_data_using_repair(db, keyspace_name, std::move(desired_ranges), std::move(range_sources), reason).get();
rlogger.info("bootstrap_with_repair: finished with keyspace={}, nr_ranges={}", keyspace_name, nr_ranges);
}
rlogger.info("bootstrap_with_repair: finished with keyspaces={}", keyspaces);
@@ -1730,6 +1734,7 @@ future<> do_decommission_removenode_with_repair(seastar::sharded<database>& db,
auto keyspaces = db.local().get_non_system_keyspaces();
bool is_removenode = myip != leaving_node;
auto op = is_removenode ? "removenode_with_repair" : "decommission_with_repair";
streaming::stream_reason reason = is_removenode ? streaming::stream_reason::removenode : streaming::stream_reason::decommission;
rlogger.info("{}: started with keyspaces={}, leaving_node={}", op, keyspaces, leaving_node);
for (auto& keyspace_name : keyspaces) {
if (!db.local().has_keyspace(keyspace_name)) {
@@ -1867,7 +1872,7 @@ future<> do_decommission_removenode_with_repair(seastar::sharded<database>& db,
ranges.swap(ranges_for_removenode);
}
auto nr_ranges_synced = ranges.size();
sync_data_using_repair(db, keyspace_name, std::move(ranges), std::move(range_sources)).get();
sync_data_using_repair(db, keyspace_name, std::move(ranges), std::move(range_sources), reason).get();
rlogger.info("{}: finished with keyspace={}, leaving_node={}, nr_ranges={}, nr_ranges_synced={}, nr_ranges_skipped={}",
op, keyspace_name, leaving_node, nr_ranges_total, nr_ranges_synced, nr_ranges_skipped);
}
@@ -1883,8 +1888,8 @@ future<> removenode_with_repair(seastar::sharded<database>& db, locator::token_m
return do_decommission_removenode_with_repair(db, std::move(tm), std::move(leaving_node));
}
future<> do_rebuild_replace_with_repair(seastar::sharded<database>& db, locator::token_metadata tm, sstring op, sstring source_dc) {
return seastar::async([&db, tm = std::move(tm), source_dc = std::move(source_dc), op = std::move(op)] () mutable {
future<> do_rebuild_replace_with_repair(seastar::sharded<database>& db, locator::token_metadata tm, sstring op, sstring source_dc, streaming::stream_reason reason) {
return seastar::async([&db, tm = std::move(tm), source_dc = std::move(source_dc), op = std::move(op), reason] () mutable {
auto keyspaces = db.local().get_non_system_keyspaces();
rlogger.info("{}: started with keyspaces={}, source_dc={}", op, keyspaces, source_dc);
auto myip = utils::fb_utilities::get_broadcast_address();
@@ -1921,7 +1926,7 @@ future<> do_rebuild_replace_with_repair(seastar::sharded<database>& db, locator:
}
}
auto nr_ranges = ranges.size();
sync_data_using_repair(db, keyspace_name, std::move(ranges), std::move(range_sources)).get();
sync_data_using_repair(db, keyspace_name, std::move(ranges), std::move(range_sources), reason).get();
rlogger.info("{}: finished with keyspace={}, source_dc={}, nr_ranges={}", op, keyspace_name, source_dc, nr_ranges);
}
rlogger.info("{}: finished with keyspaces={}, source_dc={}", op, keyspaces, source_dc);
@@ -1933,11 +1938,13 @@ future<> rebuild_with_repair(seastar::sharded<database>& db, locator::token_meta
if (source_dc.empty()) {
source_dc = get_local_dc();
}
return do_rebuild_replace_with_repair(db, std::move(tm), std::move(op), std::move(source_dc));
auto reason = streaming::stream_reason::rebuild;
return do_rebuild_replace_with_repair(db, std::move(tm), std::move(op), std::move(source_dc), reason);
}
future<> replace_with_repair(seastar::sharded<database>& db, locator::token_metadata tm) {
auto op = sstring("replace_with_repair");
auto source_dc = get_local_dc();
return do_rebuild_replace_with_repair(db, std::move(tm), std::move(op), std::move(source_dc));
auto reason = streaming::stream_reason::bootstrap;
return do_rebuild_replace_with_repair(db, std::move(tm), std::move(op), std::move(source_dc), reason);
}

View File

@@ -181,6 +181,7 @@ public:
shard_id shard;
std::vector<sstring> data_centers;
std::vector<sstring> hosts;
streaming::stream_reason reason;
std::unordered_map<dht::token_range, repair_neighbors> neighbors;
size_t nr_failed_ranges = 0;
bool aborted = false;
@@ -211,7 +212,8 @@ public:
const std::vector<sstring>& cfs_,
int id_,
const std::vector<sstring>& data_centers_,
const std::vector<sstring>& hosts_);
const std::vector<sstring>& hosts_,
streaming::stream_reason reason_);
future<> do_streaming();
void check_failed_ranges();
future<> request_transfer_ranges(const sstring& cf,

View File

@@ -443,7 +443,7 @@ class repair_writer {
uint64_t _estimated_partitions;
size_t _nr_peer_nodes;
// Needs more than one for repair master
std::vector<std::optional<future<uint64_t>>> _writer_done;
std::vector<std::optional<future<>>> _writer_done;
std::vector<std::optional<seastar::queue<mutation_fragment_opt>>> _mq;
// Current partition written to disk
std::vector<lw_shared_ptr<const decorated_key_with_hash>> _current_dk_written_to_sstable;
@@ -451,14 +451,18 @@ class repair_writer {
// partition_start is written and is closed when a partition_end is
// written.
std::vector<bool> _partition_opened;
streaming::stream_reason _reason;
named_semaphore _sem{1, named_semaphore_exception_factory{"repair_writer"}};
public:
repair_writer(
schema_ptr schema,
uint64_t estimated_partitions,
size_t nr_peer_nodes)
size_t nr_peer_nodes,
streaming::stream_reason reason)
: _schema(std::move(schema))
, _estimated_partitions(estimated_partitions)
, _nr_peer_nodes(nr_peer_nodes) {
, _nr_peer_nodes(nr_peer_nodes)
, _reason(reason) {
init_writer();
}
@@ -495,9 +499,9 @@ public:
table& t = db.local().find_column_family(_schema->id());
_writer_done[node_idx] = mutation_writer::distribute_reader_and_consume_on_shards(_schema,
make_generating_reader(_schema, std::move(get_next_mutation_fragment)),
[&db, estimated_partitions = this->_estimated_partitions] (flat_mutation_reader reader) {
[&db, reason = this->_reason, estimated_partitions = this->_estimated_partitions] (flat_mutation_reader reader) {
auto& t = db.local().find_column_family(reader.schema());
return db::view::check_needs_view_update_path(_sys_dist_ks->local(), t, streaming::stream_reason::repair).then([t = t.shared_from_this(), estimated_partitions, reader = std::move(reader)] (bool use_view_update_path) mutable {
return db::view::check_needs_view_update_path(_sys_dist_ks->local(), t, reason).then([t = t.shared_from_this(), estimated_partitions, reader = std::move(reader)] (bool use_view_update_path) mutable {
//FIXME: for better estimations this should be transmitted from remote
auto metadata = mutation_source_metadata{};
auto& cs = t->get_compaction_strategy();
@@ -523,7 +527,15 @@ public:
return consumer(std::move(reader));
});
},
t.stream_in_progress());
t.stream_in_progress()).then([this, node_idx] (uint64_t partitions) {
rlogger.debug("repair_writer: keyspace={}, table={}, managed to write partitions={} to sstable",
_schema->ks_name(), _schema->cf_name(), partitions);
}).handle_exception([this, node_idx] (std::exception_ptr ep) {
rlogger.warn("repair_writer: keyspace={}, table={}, multishard_writer failed: {}",
_schema->ks_name(), _schema->cf_name(), ep);
_mq[node_idx]->abort(ep);
return make_exception_future<>(std::move(ep));
});
}
future<> write_partition_end(unsigned node_idx) {
@@ -550,23 +562,41 @@ public:
}
}
future<> write_end_of_stream(unsigned node_idx) {
if (_mq[node_idx]) {
return with_semaphore(_sem, 1, [this, node_idx] {
// Partition_end is never sent on wire, so we have to write one ourselves.
return write_partition_end(node_idx).then([this, node_idx] () mutable {
// Empty mutation_fragment_opt means no more data, so the writer can seal the sstables.
return _mq[node_idx]->push_eventually(mutation_fragment_opt());
});
});
} else {
return make_ready_future<>();
}
}
future<> do_wait_for_writer_done(unsigned node_idx) {
if (_writer_done[node_idx]) {
return std::move(*(_writer_done[node_idx]));
} else {
return make_ready_future<>();
}
}
future<> wait_for_writer_done() {
return parallel_for_each(boost::irange(unsigned(0), unsigned(_nr_peer_nodes)), [this] (unsigned node_idx) {
if (_writer_done[node_idx] && _mq[node_idx]) {
// Partition_end is never sent on wire, so we have to write one ourselves.
return write_partition_end(node_idx).then([this, node_idx] () mutable {
// Empty mutation_fragment_opt means no more data, so the writer can seal the sstables.
return _mq[node_idx]->push_eventually(mutation_fragment_opt()).then([this, node_idx] () mutable {
return (*_writer_done[node_idx]).then([] (uint64_t partitions) {
rlogger.debug("Managed to write partitions={} to sstable", partitions);
return make_ready_future<>();
});
});
});
}
return make_ready_future<>();
return when_all_succeed(write_end_of_stream(node_idx), do_wait_for_writer_done(node_idx));
}).handle_exception([this] (std::exception_ptr ep) {
rlogger.warn("repair_writer: keyspace={}, table={}, wait_for_writer_done failed: {}",
_schema->ks_name(), _schema->cf_name(), ep);
return make_exception_future<>(std::move(ep));
});
}
named_semaphore& sem() {
return _sem;
}
};
class repair_meta {
@@ -590,6 +620,7 @@ private:
repair_master _repair_master;
gms::inet_address _myip;
uint32_t _repair_meta_id;
streaming::stream_reason _reason;
// Repair master's sharding configuration
shard_config _master_node_shard_config;
// Partitioner of repair master
@@ -653,6 +684,7 @@ public:
uint64_t seed,
repair_master master,
uint32_t repair_meta_id,
streaming::stream_reason reason,
shard_config master_node_shard_config,
size_t nr_peer_nodes = 1)
: _db(db)
@@ -666,6 +698,7 @@ public:
, _repair_master(master)
, _myip(utils::fb_utilities::get_broadcast_address())
, _repair_meta_id(repair_meta_id)
, _reason(reason)
, _master_node_shard_config(std::move(master_node_shard_config))
, _remote_partitioner(make_remote_partitioner())
, _same_sharding_config(is_same_sharding_config())
@@ -681,7 +714,7 @@ public:
_seed,
repair_reader::is_local_reader(_repair_master || _same_sharding_config)
)
, _repair_writer(_schema, _estimated_partitions, _nr_peer_nodes)
, _repair_writer(_schema, _estimated_partitions, _nr_peer_nodes, _reason)
, _sink_source_for_get_full_row_hashes(_repair_meta_id, _nr_peer_nodes,
[] (uint32_t repair_meta_id, netw::messaging_service::msg_addr addr) {
return netw::get_local_messaging_service().make_sink_and_source_for_repair_get_full_row_hashes_with_rpc_stream(repair_meta_id, addr);
@@ -731,7 +764,8 @@ public:
uint64_t max_row_buf_size,
uint64_t seed,
shard_config master_node_shard_config,
table_schema_version schema_version) {
table_schema_version schema_version,
streaming::stream_reason reason) {
return service::get_schema_for_write(schema_version, {from, src_cpu_id}).then([from,
repair_meta_id,
range,
@@ -739,7 +773,8 @@ public:
max_row_buf_size,
seed,
master_node_shard_config,
schema_version] (schema_ptr s) {
schema_version,
reason] (schema_ptr s) {
auto& db = service::get_local_storage_proxy().get_db();
auto& cf = db.local().find_column_family(s->id());
node_repair_meta_id id{from, repair_meta_id};
@@ -752,6 +787,7 @@ public:
seed,
repair_meta::repair_master::no,
repair_meta_id,
reason,
std::move(master_node_shard_config));
bool insertion = repair_meta_map().emplace(id, rm).second;
if (!insertion) {
@@ -1166,6 +1202,23 @@ private:
}
}
future<> do_apply_rows(std::list<repair_row>& row_diff, unsigned node_idx, update_working_row_buf update_buf) {
return with_semaphore(_repair_writer.sem(), 1, [this, node_idx, update_buf, &row_diff] {
_repair_writer.create_writer(_db, node_idx);
return do_for_each(row_diff, [this, node_idx, update_buf] (repair_row& r) {
if (update_buf) {
_working_row_buf_combined_hash.add(r.hash());
}
// The repair_row here is supposed to have
// mutation_fragment attached because we have stored it in
// to_repair_rows_list above where the repair_row is created.
mutation_fragment mf = std::move(r.get_mutation_fragment());
auto dk_with_hash = r.get_dk_with_hash();
return _repair_writer.do_write(node_idx, std::move(dk_with_hash), std::move(mf));
});
});
}
// Give a list of rows, apply the rows to disk and update the _working_row_buf and _peer_row_hash_sets if requested
// Must run inside a seastar thread
void apply_rows_on_master_in_thread(repair_rows_on_wire rows, gms::inet_address from, update_working_row_buf update_buf,
@@ -1191,18 +1244,7 @@ private:
_peer_row_hash_sets[node_idx] = boost::copy_range<std::unordered_set<repair_hash>>(row_diff |
boost::adaptors::transformed([] (repair_row& r) { thread::maybe_yield(); return r.hash(); }));
}
_repair_writer.create_writer(_db, node_idx);
for (auto& r : row_diff) {
if (update_buf) {
_working_row_buf_combined_hash.add(r.hash());
}
// The repair_row here is supposed to have
// mutation_fragment attached because we have stored it in
// to_repair_rows_list above where the repair_row is created.
mutation_fragment mf = std::move(r.get_mutation_fragment());
auto dk_with_hash = r.get_dk_with_hash();
_repair_writer.do_write(node_idx, std::move(dk_with_hash), std::move(mf)).get();
}
do_apply_rows(row_diff, node_idx, update_buf).get();
}
future<>
@@ -1213,15 +1255,7 @@ private:
return to_repair_rows_list(rows).then([this] (std::list<repair_row> row_diff) {
return do_with(std::move(row_diff), [this] (std::list<repair_row>& row_diff) {
unsigned node_idx = 0;
_repair_writer.create_writer(_db, node_idx);
return do_for_each(row_diff, [this, node_idx] (repair_row& r) {
// The repair_row here is supposed to have
// mutation_fragment attached because we have stored it in
// to_repair_rows_list above where the repair_row is created.
mutation_fragment mf = std::move(r.get_mutation_fragment());
auto dk_with_hash = r.get_dk_with_hash();
return _repair_writer.do_write(node_idx, std::move(dk_with_hash), std::move(mf));
});
return do_apply_rows(row_diff, node_idx, update_working_row_buf::no);
});
});
}
@@ -1412,28 +1446,28 @@ public:
// RPC API
future<>
repair_row_level_start(gms::inet_address remote_node, sstring ks_name, sstring cf_name, dht::token_range range, table_schema_version schema_version) {
repair_row_level_start(gms::inet_address remote_node, sstring ks_name, sstring cf_name, dht::token_range range, table_schema_version schema_version, streaming::stream_reason reason) {
if (remote_node == _myip) {
return make_ready_future<>();
}
stats().rpc_call_nr++;
return netw::get_local_messaging_service().send_repair_row_level_start(msg_addr(remote_node),
_repair_meta_id, std::move(ks_name), std::move(cf_name), std::move(range), _algo, _max_row_buf_size, _seed,
_master_node_shard_config.shard, _master_node_shard_config.shard_count, _master_node_shard_config.ignore_msb, _master_node_shard_config.partitioner_name, std::move(schema_version));
_master_node_shard_config.shard, _master_node_shard_config.shard_count, _master_node_shard_config.ignore_msb, _master_node_shard_config.partitioner_name, std::move(schema_version), reason);
}
// RPC handler
static future<>
repair_row_level_start_handler(gms::inet_address from, uint32_t src_cpu_id, uint32_t repair_meta_id, sstring ks_name, sstring cf_name,
dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size,
uint64_t seed, shard_config master_node_shard_config, table_schema_version schema_version) {
uint64_t seed, shard_config master_node_shard_config, table_schema_version schema_version, streaming::stream_reason reason) {
if (!_sys_dist_ks->local_is_initialized() || !_view_update_generator->local_is_initialized()) {
return make_exception_future<>(std::runtime_error(format("Node {} is not fully initialized for repair, try again later",
utils::fb_utilities::get_broadcast_address())));
}
rlogger.debug(">>> Started Row Level Repair (Follower): local={}, peers={}, repair_meta_id={}, keyspace={}, cf={}, schema_version={}, range={}, seed={}, max_row_buf_siz={}",
utils::fb_utilities::get_broadcast_address(), from, repair_meta_id, ks_name, cf_name, schema_version, range, seed, max_row_buf_size);
return insert_repair_meta(from, src_cpu_id, repair_meta_id, std::move(range), algo, max_row_buf_size, seed, std::move(master_node_shard_config), std::move(schema_version));
return insert_repair_meta(from, src_cpu_id, repair_meta_id, std::move(range), algo, max_row_buf_size, seed, std::move(master_node_shard_config), std::move(schema_version), reason);
}
// RPC API
@@ -1904,22 +1938,17 @@ static future<> repair_get_row_diff_with_rpc_stream_handler(
current_set_diff,
std::move(hash_cmd_opt)).handle_exception([sink, &error] (std::exception_ptr ep) mutable {
error = true;
return sink(repair_row_on_wire_with_cmd{repair_stream_cmd::error, repair_row_on_wire()}).then([sink] () mutable {
return sink.close();
}).then([sink] {
return sink(repair_row_on_wire_with_cmd{repair_stream_cmd::error, repair_row_on_wire()}).then([] {
return make_ready_future<stop_iteration>(stop_iteration::no);
});
});
} else {
if (error) {
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
return sink.close().then([sink] {
return make_ready_future<stop_iteration>(stop_iteration::yes);
});
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
});
});
}).finally([sink] () mutable {
return sink.close().finally([sink] { });
});
}
@@ -1945,22 +1974,17 @@ static future<> repair_put_row_diff_with_rpc_stream_handler(
current_rows,
std::move(row_opt)).handle_exception([sink, &error] (std::exception_ptr ep) mutable {
error = true;
return sink(repair_stream_cmd::error).then([sink] () mutable {
return sink.close();
}).then([sink] {
return sink(repair_stream_cmd::error).then([] {
return make_ready_future<stop_iteration>(stop_iteration::no);
});
});
} else {
if (error) {
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
return sink.close().then([sink] {
return make_ready_future<stop_iteration>(stop_iteration::yes);
});
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
});
});
}).finally([sink] () mutable {
return sink.close().finally([sink] { });
});
}
@@ -1985,22 +2009,17 @@ static future<> repair_get_full_row_hashes_with_rpc_stream_handler(
error,
std::move(status_opt)).handle_exception([sink, &error] (std::exception_ptr ep) mutable {
error = true;
return sink(repair_hash_with_cmd{repair_stream_cmd::error, repair_hash()}).then([sink] () mutable {
return sink.close();
}).then([sink] {
return sink(repair_hash_with_cmd{repair_stream_cmd::error, repair_hash()}).then([] () {
return make_ready_future<stop_iteration>(stop_iteration::no);
});
});
} else {
if (error) {
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
return sink.close().then([sink] {
return make_ready_future<stop_iteration>(stop_iteration::yes);
});
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
});
});
}).finally([sink] () mutable {
return sink.close().finally([sink] { });
});
}
@@ -2104,15 +2123,16 @@ future<> repair_init_messaging_service_handler(repair_service& rs, distributed<d
});
ms.register_repair_row_level_start([] (const rpc::client_info& cinfo, uint32_t repair_meta_id, sstring ks_name,
sstring cf_name, dht::token_range range, row_level_diff_detect_algorithm algo, uint64_t max_row_buf_size, uint64_t seed,
unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name, table_schema_version schema_version) {
unsigned remote_shard, unsigned remote_shard_count, unsigned remote_ignore_msb, sstring remote_partitioner_name, table_schema_version schema_version, rpc::optional<streaming::stream_reason> reason) {
auto src_cpu_id = cinfo.retrieve_auxiliary<uint32_t>("src_cpu_id");
auto from = cinfo.retrieve_auxiliary<gms::inet_address>("baddr");
return smp::submit_to(src_cpu_id % smp::count, [from, src_cpu_id, repair_meta_id, ks_name, cf_name,
range, algo, max_row_buf_size, seed, remote_shard, remote_shard_count, remote_ignore_msb, remote_partitioner_name, schema_version] () mutable {
range, algo, max_row_buf_size, seed, remote_shard, remote_shard_count, remote_ignore_msb, remote_partitioner_name, schema_version, reason] () mutable {
streaming::stream_reason r = reason ? *reason : streaming::stream_reason::repair;
return repair_meta::repair_row_level_start_handler(from, src_cpu_id, repair_meta_id, std::move(ks_name),
std::move(cf_name), std::move(range), algo, max_row_buf_size, seed,
shard_config{remote_shard, remote_shard_count, remote_ignore_msb, std::move(remote_partitioner_name)},
schema_version);
schema_version, r);
});
});
ms.register_repair_row_level_stop([] (const rpc::client_info& cinfo, uint32_t repair_meta_id,
@@ -2442,6 +2462,7 @@ public:
_seed,
repair_meta::repair_master::yes,
repair_meta_id,
_ri.reason,
std::move(master_node_shard_config),
_all_live_peer_nodes.size());
@@ -2456,7 +2477,7 @@ public:
nodes_to_stop.reserve(_all_nodes.size());
try {
parallel_for_each(_all_nodes, [&, this] (const gms::inet_address& node) {
return master.repair_row_level_start(node, _ri.keyspace, _cf_name, _range, schema_version).then([&] () {
return master.repair_row_level_start(node, _ri.keyspace, _cf_name, _range, schema_version, _ri.reason).then([&] () {
nodes_to_stop.push_back(node);
return master.repair_get_estimated_partitions(node).then([this, node] (uint64_t partitions) {
rlogger.trace("Get repair_get_estimated_partitions for node={}, estimated_partitions={}", node, partitions);

View File

@@ -528,8 +528,12 @@ public:
return _reader.move_to_next_partition(timeout).then([this] (auto&& mfopt) mutable {
{
if (!mfopt) {
this->handle_end_of_stream();
return make_ready_future<flat_mutation_reader_opt, mutation_fragment_opt>(std::nullopt, std::nullopt);
return _cache._read_section(_cache._tracker.region(), [&] {
return with_linearized_managed_bytes([&] {
this->handle_end_of_stream();
return make_ready_future<flat_mutation_reader_opt, mutation_fragment_opt>(std::nullopt, std::nullopt);
});
});
}
_cache.on_partition_miss();
const partition_start& ps = mfopt->as_partition_start();
@@ -952,13 +956,15 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)
// expensive and we need to amortize it somehow.
do {
STAP_PROBE(scylla, row_cache_update_partition_start);
with_linearized_managed_bytes([&] {
{
if (!update) {
_update_section(_tracker.region(), [&] {
with_linearized_managed_bytes([&] {
memtable_entry& mem_e = *m.partitions.begin();
size_entry = mem_e.size_in_allocator_without_rows(_tracker.allocator());
auto cache_i = _partitions.lower_bound(mem_e.key(), cmp);
update = updater(_update_section, cache_i, mem_e, is_present, real_dirty_acc);
});
});
}
// We use cooperative deferring instead of futures so that
@@ -970,14 +976,16 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)
update = {};
real_dirty_acc.unpin_memory(size_entry);
_update_section(_tracker.region(), [&] {
with_linearized_managed_bytes([&] {
auto i = m.partitions.begin();
memtable_entry& mem_e = *i;
m.partitions.erase(i);
mem_e.partition().evict(_tracker.memtable_cleaner());
current_allocator().destroy(&mem_e);
});
});
++partition_count;
});
}
STAP_PROBE(scylla, row_cache_update_partition_end);
} while (!m.partitions.empty() && !need_preempt());
with_allocator(standard_allocator(), [&] {
@@ -1124,8 +1132,8 @@ future<> row_cache::invalidate(external_updater eu, dht::partition_range_vector&
seastar::thread::maybe_yield();
while (true) {
auto done = with_linearized_managed_bytes([&] {
return _update_section(_tracker.region(), [&] {
auto done = _update_section(_tracker.region(), [&] {
return with_linearized_managed_bytes([&] {
auto cmp = cache_entry::compare(_schema);
auto it = _partitions.lower_bound(*_prev_snapshot_pos, cmp);
auto end = _partitions.lower_bound(dht::ring_position_view::for_range_end(range), cmp);

View File

@@ -319,10 +319,10 @@ schema::schema(const raw_schema& raw, std::optional<raw_view_info> raw_view_info
+ column_offset(column_kind::regular_column),
_raw._columns.end(), column_definition::name_comparator(regular_column_name_type()));
std::sort(_raw._columns.begin(),
std::stable_sort(_raw._columns.begin(),
_raw._columns.begin() + column_offset(column_kind::clustering_key),
[] (auto x, auto y) { return x.id < y.id; });
std::sort(_raw._columns.begin() + column_offset(column_kind::clustering_key),
std::stable_sort(_raw._columns.begin() + column_offset(column_kind::clustering_key),
_raw._columns.begin() + column_offset(column_kind::static_column),
[] (auto x, auto y) { return x.id < y.id; });

View File

@@ -33,9 +33,10 @@ import os
procs = os.sysconf('SC_NPROCESSORS_ONLN')
mem = os.sysconf('SC_PHYS_PAGES') * os.sysconf('SC_PAGESIZE')
mem_reserve = 1000000000
job_mem = 4000000000
jobs = min(procs, mem // job_mem)
jobs = min(procs, (mem-mem_reserve) // job_mem)
jobs = max(jobs, 1)
print(jobs)

Submodule seastar updated: 92c488706c...4ee384e15f

View File

@@ -190,4 +190,11 @@ future<> paxos_state::learn(schema_ptr schema, proposal decision, clock_type::ti
});
}
future<> paxos_state::prune(schema_ptr schema, const partition_key& key, utils::UUID ballot, clock_type::time_point timeout,
tracing::trace_state_ptr tr_state) {
logger.debug("Delete paxos state for ballot {}", ballot);
tracing::trace(tr_state, "Delete paxos state for ballot {}", ballot);
return db::system_keyspace::delete_paxos_decision(*schema, key, ballot, timeout);
}
} // end of namespace "service::paxos"

View File

@@ -124,6 +124,9 @@ public:
clock_type::time_point timeout);
// Replica RPC endpoint for Paxos "learn".
static future<> learn(schema_ptr schema, proposal decision, clock_type::time_point timeout, tracing::trace_state_ptr tr_state);
// Replica RPC endpoint for pruning Paxos table
static future<> prune(schema_ptr schema, const partition_key& key, utils::UUID ballot, clock_type::time_point timeout,
tracing::trace_state_ptr tr_state);
};
} // end of namespace "service::paxos"

View File

@@ -171,6 +171,7 @@ public:
const schema_ptr& schema() {
return _schema;
}
// called only when all replicas replied
virtual void release_mutation() = 0;
};
@@ -300,9 +301,10 @@ public:
class cas_mutation : public mutation_holder {
lw_shared_ptr<paxos::proposal> _proposal;
shared_ptr<paxos_response_handler> _handler;
public:
explicit cas_mutation(paxos::proposal proposal , schema_ptr s)
: _proposal(make_lw_shared<paxos::proposal>(std::move(proposal))) {
explicit cas_mutation(paxos::proposal proposal, schema_ptr s, shared_ptr<paxos_response_handler> handler)
: _proposal(make_lw_shared<paxos::proposal>(std::move(proposal))), _handler(std::move(handler)) {
_size = _proposal->update.representation().size();
_schema = std::move(s);
}
@@ -327,7 +329,11 @@ public:
return true;
}
virtual void release_mutation() override {
_proposal.release();
// The handler will be set for "learn", but not for PAXOS repair
// since repair may not include all replicas
if (_handler) {
_handler->prune(_proposal->ballot);
}
}
};
@@ -1184,6 +1190,12 @@ future<bool> paxos_response_handler::accept_proposal(const paxos::proposal& prop
return f;
}
// debug output in mutate_internal needs this
std::ostream& operator<<(std::ostream& os, const paxos_response_handler& h) {
os << "paxos_response_handler{" << h.id() << "}";
return os;
}
// This function implements learning stage of Paxos protocol
future<> paxos_response_handler::learn_decision(paxos::proposal decision, bool allow_hints) {
tracing::trace(tr_state, "learn_decision: committing {} with cl={}", decision, _cl_for_learn);
@@ -1219,12 +1231,41 @@ future<> paxos_response_handler::learn_decision(paxos::proposal decision, bool a
}
// Path for the "base" mutations
std::array<std::tuple<paxos::proposal, schema_ptr, dht::token>, 1> m{std::make_tuple(std::move(decision), _schema, _key.token())};
std::array<std::tuple<paxos::proposal, schema_ptr, shared_ptr<paxos_response_handler>, dht::token>, 1> m{std::make_tuple(std::move(decision), _schema, shared_from_this(), _key.token())};
future<> f_lwt = _proxy->mutate_internal(std::move(m), _cl_for_learn, false, tr_state, _permit, _timeout);
return when_all_succeed(std::move(f_cdc), std::move(f_lwt));
}
void paxos_response_handler::prune(utils::UUID ballot) {
if (_has_dead_endpoints) {
return;
}
if ( _proxy->get_stats().cas_now_pruning >= pruning_limit) {
_proxy->get_stats().cas_coordinator_dropped_prune++;
return;
}
_proxy->get_stats().cas_now_pruning++;
_proxy->get_stats().cas_prune++;
// running in the background, but the amount of the bg job is limited by pruning_limit
// it is waited by holding shared pointer to storage_proxy which guaranties
// that storage_proxy::stop() will wait for this to complete
(void)parallel_for_each(_live_endpoints, [this, ballot] (gms::inet_address peer) mutable {
return futurize_apply([&] {
if (fbu::is_me(peer)) {
tracing::trace(tr_state, "prune: prune {} locally", ballot);
return paxos::paxos_state::prune(_schema, _key.key(), ballot, _timeout, tr_state);
} else {
tracing::trace(tr_state, "prune: send prune of {} to {}", ballot, peer);
netw::messaging_service& ms = netw::get_local_messaging_service();
return ms.send_paxos_prune(peer, _timeout, _schema->version(), _key.key(), ballot, tracing::make_trace_info(tr_state));
}
});
}).finally([h = shared_from_this()] {
h->_proxy->get_stats().cas_now_pruning--;
});
}
static std::vector<gms::inet_address>
replica_ids_to_endpoints(locator::token_metadata& tm, const std::vector<utils::UUID>& replica_ids) {
std::vector<gms::inet_address> endpoints;
@@ -1571,6 +1612,14 @@ void storage_proxy_stats::stats::register_stats() {
sm::make_histogram("cas_write_contention", sm::description("how many contended writes were encountered"),
{storage_proxy_stats::current_scheduling_group_label()},
[this]{ return cas_write_contention.get_histogram(1, 8);}),
sm::make_total_operations("cas_prune", cas_prune,
sm::description("how many times paxos prune was done after successful cas operation"),
{storage_proxy_stats::current_scheduling_group_label()}),
sm::make_total_operations("cas_dropped_prune", cas_coordinator_dropped_prune,
sm::description("how many times a coordinator did not perfom prune after cas"),
{storage_proxy_stats::current_scheduling_group_label()}),
});
_metrics.add_group(REPLICA_STATS_CATEGORY, {
@@ -1606,6 +1655,9 @@ void storage_proxy_stats::stats::register_stats() {
sm::description("number of operations that crossed a shard boundary"),
{storage_proxy_stats::current_scheduling_group_label()}),
sm::make_total_operations("cas_dropped_prune", cas_replica_dropped_prune,
sm::description("how many times a coordinator did not perfom prune after cas"),
{storage_proxy_stats::current_scheduling_group_label()}),
});
}
@@ -1879,11 +1931,11 @@ storage_proxy::create_write_response_handler(const std::unordered_map<gms::inet_
}
storage_proxy::response_id_type
storage_proxy::create_write_response_handler(const std::tuple<paxos::proposal, schema_ptr, dht::token>& meta,
storage_proxy::create_write_response_handler(const std::tuple<paxos::proposal, schema_ptr, shared_ptr<paxos_response_handler>, dht::token>& meta,
db::consistency_level cl, db::write_type type, tracing::trace_state_ptr tr_state, service_permit permit) {
auto& [commit, s, t] = meta;
auto& [commit, s, h, t] = meta;
return create_write_response_handler_helper(s, t, std::make_unique<cas_mutation>(std::move(commit), s), cl,
return create_write_response_handler_helper(s, t, std::make_unique<cas_mutation>(std::move(commit), s, std::move(h)), cl,
db::write_type::CAS, tr_state, std::move(permit));
}
@@ -1898,7 +1950,7 @@ storage_proxy::create_write_response_handler(const std::tuple<paxos::proposal, s
auto keyspace_name = s->ks_name();
keyspace& ks = _db.local().find_keyspace(keyspace_name);
return create_write_response_handler(ks, cl, db::write_type::CAS, std::make_unique<cas_mutation>(std::move(commit), s), std::move(endpoints),
return create_write_response_handler(ks, cl, db::write_type::CAS, std::make_unique<cas_mutation>(std::move(commit), s, nullptr), std::move(endpoints),
std::vector<gms::inet_address>(), std::vector<gms::inet_address>(), std::move(tr_state), get_stats(), std::move(permit));
}
@@ -2146,6 +2198,8 @@ storage_proxy::get_paxos_participants(const sstring& ks_name, const dht::token &
cl_for_paxos, participants + 1, live_endpoints.size());
}
bool dead = participants != live_endpoints.size();
// Apart from the ballot, paxos_state::prepare() also sends the current value of the requested key.
// If the values received from different replicas match, we skip a separate query stage thus saving
// one network round trip. To generate less traffic, only closest replicas send data, others send
@@ -2153,7 +2207,7 @@ storage_proxy::get_paxos_participants(const sstring& ks_name, const dht::token &
// list of participants by proximity to this instance.
sort_endpoints_by_proximity(live_endpoints);
return paxos_participants{std::move(live_endpoints), required_participants};
return paxos_participants{std::move(live_endpoints), required_participants, dead};
}
@@ -3412,7 +3466,9 @@ protected:
uint32_t original_partition_limit() const {
return _cmd->partition_limit;
}
virtual void adjust_targets_for_reconciliation() {}
void reconcile(db::consistency_level cl, storage_proxy::clock_type::time_point timeout, lw_shared_ptr<query::read_command> cmd) {
adjust_targets_for_reconciliation();
data_resolver_ptr data_resolver = ::make_shared<data_read_resolver>(_schema, cl, _targets.size(), timeout);
auto exec = shared_from_this();
@@ -3639,6 +3695,9 @@ public:
virtual void got_cl() override {
_speculate_timer.cancel();
}
virtual void adjust_targets_for_reconciliation() override {
_targets = used_targets();
}
};
class range_slice_read_executor : public never_speculating_read_executor {
@@ -4942,6 +5001,42 @@ void storage_proxy::init_messaging_service() {
return f;
});
ms.register_paxos_prune([this] (const rpc::client_info& cinfo, rpc::opt_time_point timeout,
utils::UUID schema_id, partition_key key, utils::UUID ballot, std::optional<tracing::trace_info> trace_info) {
static thread_local uint16_t pruning = 0;
static constexpr uint16_t pruning_limit = 1000; // since PRUNE verb is one way replica side has its own queue limit
auto src_addr = netw::messaging_service::get_source(cinfo);
auto src_ip = src_addr.addr;
tracing::trace_state_ptr tr_state;
if (trace_info) {
tr_state = tracing::tracing::get_local_tracing_instance().create_session(*trace_info);
tracing::begin(tr_state);
tracing::trace(tr_state, "paxos_prune: message received from /{} ballot {}", src_ip, ballot);
}
if (pruning >= pruning_limit) {
get_stats().cas_replica_dropped_prune++;
tracing::trace(tr_state, "paxos_prune: do not prune due to overload", src_ip);
return make_ready_future<seastar::rpc::no_wait_type>(netw::messaging_service::no_wait());
}
pruning++;
return get_schema_for_read(schema_id, src_addr).then([this, key = std::move(key), ballot,
timeout, tr_state = std::move(tr_state), src_ip] (schema_ptr schema) mutable {
dht::token token = dht::get_token(*schema, key);
unsigned shard = dht::shard_of(*schema, token);
bool local = shard == engine().cpu_id();
get_stats().replica_cross_shard_ops += !local;
return smp::submit_to(shard, _write_smp_service_group, [gs = global_schema_ptr(schema), gt = tracing::global_trace_state_ptr(std::move(tr_state)),
local, key = std::move(key), ballot, timeout, src_ip, d = defer([] { pruning--; })] () {
tracing::trace_state_ptr tr_state = gt;
return paxos::paxos_state::prune(gs, key, ballot, *timeout, tr_state).then([src_ip, tr_state] () {
tracing::trace(tr_state, "paxos_prune: handling is done, sending a response to /{}", src_ip);
return netw::messaging_service::no_wait();
});
});
});
});
}
future<> storage_proxy::uninit_messaging_service() {
@@ -4956,7 +5051,8 @@ future<> storage_proxy::uninit_messaging_service() {
ms.unregister_truncate(),
ms.unregister_paxos_prepare(),
ms.unregister_paxos_accept(),
ms.unregister_paxos_learn()
ms.unregister_paxos_learn(),
ms.unregister_paxos_prune()
);
}

View File

@@ -242,6 +242,7 @@ public:
std::vector<gms::inet_address> endpoints;
// How many participants are required for a quorum (i.e. is it SERIAL or LOCAL_SERIAL).
size_t required_participants;
bool has_dead_endpoints;
};
const gms::feature_service& features() const { return _features; }
@@ -317,7 +318,7 @@ private:
response_id_type create_write_response_handler(const mutation&, db::consistency_level cl, db::write_type type, tracing::trace_state_ptr tr_state, service_permit permit);
response_id_type create_write_response_handler(const hint_wrapper&, db::consistency_level cl, db::write_type type, tracing::trace_state_ptr tr_state, service_permit permit);
response_id_type create_write_response_handler(const std::unordered_map<gms::inet_address, std::optional<mutation>>&, db::consistency_level cl, db::write_type type, tracing::trace_state_ptr tr_state, service_permit permit);
response_id_type create_write_response_handler(const std::tuple<paxos::proposal, schema_ptr, dht::token>& proposal,
response_id_type create_write_response_handler(const std::tuple<paxos::proposal, schema_ptr, shared_ptr<paxos_response_handler>, dht::token>& proposal,
db::consistency_level cl, db::write_type type, tracing::trace_state_ptr tr_state, service_permit permit);
response_id_type create_write_response_handler(const std::tuple<paxos::proposal, schema_ptr, dht::token, std::unordered_set<gms::inet_address>>& meta,
db::consistency_level cl, db::write_type type, tracing::trace_state_ptr tr_state, service_permit permit);
@@ -634,6 +635,11 @@ private:
db::consistency_level _cl_for_learn;
// Live endpoints, as per get_paxos_participants()
std::vector<gms::inet_address> _live_endpoints;
// True if there are dead endpoints
// We don't include endpoints known to be unavailable in pending
// endpoints list, but need to be aware of them to avoid pruning
// system.paxos data if some endpoint is missing a Paxos write.
bool _has_dead_endpoints;
// How many endpoints need to respond favourably for the protocol to progress to the next step.
size_t _required_participants;
// A deadline when the entire CAS operation timeout expires, derived from write_request_timeout_in_ms
@@ -651,6 +657,9 @@ private:
// Unique request id for logging purposes.
const uint64_t _id = next_id++;
// max pruning operations to run in parralel
static constexpr uint16_t pruning_limit = 1000;
public:
tracing::trace_state_ptr tr_state;
@@ -674,6 +683,7 @@ public:
storage_proxy::paxos_participants pp = _proxy->get_paxos_participants(_schema->ks_name(), _key.token(), _cl_for_paxos);
_live_endpoints = std::move(pp.endpoints);
_required_participants = pp.required_participants;
_has_dead_endpoints = pp.has_dead_endpoints;
tracing::trace(tr_state, "Create paxos_response_handler for token {} with live: {} and required participants: {}",
_key.token(), _live_endpoints, _required_participants);
}
@@ -691,6 +701,7 @@ public:
future<paxos::prepare_summary> prepare_ballot(utils::UUID ballot);
future<bool> accept_proposal(const paxos::proposal& proposal, bool timeout_if_partially_accepted = true);
future<> learn_decision(paxos::proposal decision, bool allow_hints = false);
void prune(utils::UUID ballot);
uint64_t id() const {
return _id;
}

View File

@@ -116,6 +116,11 @@ struct write_stats {
uint64_t cas_write_condition_not_met = 0;
uint64_t cas_write_timeout_due_to_uncertainty = 0;
uint64_t cas_failed_read_round_optimization = 0;
uint16_t cas_now_pruning = 0;
uint64_t cas_prune = 0;
uint64_t cas_coordinator_dropped_prune = 0;
uint64_t cas_replica_dropped_prune = 0;
std::chrono::microseconds last_mv_flow_control_delay; // delay added for MV flow control in the last request
public:

View File

@@ -1007,12 +1007,16 @@ storage_service::is_local_dc(const inet_address& targetHost) const {
std::unordered_map<dht::token_range, std::vector<inet_address>>
storage_service::get_range_to_address_map(const sstring& keyspace,
const std::vector<token>& sorted_tokens) const {
sstring ks = keyspace;
// some people just want to get a visual representation of things. Allow null and set it to the first
// non-system keyspace.
if (keyspace == "" && _db.local().get_non_system_keyspaces().empty()) {
throw std::runtime_error("No keyspace provided and no non system kespace exist");
if (keyspace == "") {
auto keyspaces = _db.local().get_non_system_keyspaces();
if (keyspaces.empty()) {
throw std::runtime_error("No keyspace provided and no non system kespace exist");
}
ks = keyspaces[0];
}
const sstring& ks = (keyspace == "") ? _db.local().get_non_system_keyspaces()[0] : keyspace;
return construct_range_to_endpoint_map(ks, get_all_ranges(sorted_tokens));
}
@@ -2171,7 +2175,8 @@ storage_service::get_snapshot_details() {
}
future<int64_t> storage_service::true_snapshots_size() {
return _db.map_reduce(adder<int64_t>(), [] (database& db) {
return run_snapshot_list_operation([] {
return get_local_storage_service()._db.map_reduce(adder<int64_t>(), [] (database& db) {
return do_with(int64_t(0), [&db] (auto& local_total) {
return parallel_for_each(db.get_column_families(), [&local_total] (auto& cf_pair) {
return cf_pair.second->get_snapshot_details().then([&local_total] (auto map) {
@@ -2185,6 +2190,7 @@ future<int64_t> storage_service::true_snapshots_size() {
});
});
});
});
}
static std::atomic<bool> isolated = { false };
@@ -3409,10 +3415,13 @@ void feature_enabled_listener::on_enabled() {
future<> read_sstables_format(distributed<storage_service>& ss) {
return db::system_keyspace::get_scylla_local_param(SSTABLE_FORMAT_PARAM_NAME).then([&ss] (std::optional<sstring> format_opt) {
sstables::sstable_version_types format = sstables::from_string(format_opt.value_or("ka"));
return ss.invoke_on_all([format] (storage_service& s) {
s._sstables_format = format;
});
if (format_opt) {
sstables::sstable_version_types format = sstables::from_string(*format_opt);
return ss.invoke_on_all([format] (storage_service& s) {
s._sstables_format = format;
});
}
return make_ready_future<>();
});
}

View File

@@ -312,7 +312,13 @@ private:
*/
std::optional<db_clock::time_point> _cdc_streams_ts;
sstables::sstable_version_types _sstables_format = sstables::sstable_version_types::ka;
// _sstables_format is the format used for writing new sstables.
// Here we set its default value, but if we discover that all the nodes
// in the cluster support a newer format, _sstables_format will be set to
// that format. read_sstables_format() also overwrites _sstables_format
// if an sstable format was chosen earlier (and this choice was persisted
// in the system table).
sstables::sstable_version_types _sstables_format = sstables::sstable_version_types::la;
seastar::named_semaphore _feature_listeners_sem = {1, named_semaphore_exception_factory{"feature listeners"}};
feature_enabled_listener _la_feature_listener;
feature_enabled_listener _mc_feature_listener;

View File

@@ -72,47 +72,8 @@ private:
static std::vector<column_info> build(
const schema& s,
const utils::chunked_vector<serialization_header::column_desc>& src,
bool is_static) {
std::vector<column_info> cols;
if (s.is_dense()) {
const column_definition& col = is_static ? *s.static_begin() : *s.regular_begin();
cols.push_back(column_info{
&col.name(),
col.type,
col.id,
col.type->value_length_if_fixed(),
col.is_multi_cell(),
col.is_counter(),
false
});
} else {
cols.reserve(src.size());
for (auto&& desc : src) {
const bytes& type_name = desc.type_name.value;
data_type type = db::marshal::type_parser::parse(to_sstring_view(type_name));
const column_definition* def = s.get_column_definition(desc.name.value);
std::optional<column_id> id;
bool schema_mismatch = false;
if (def) {
id = def->id;
schema_mismatch = def->is_multi_cell() != type->is_multi_cell() ||
def->is_counter() != type->is_counter() ||
!def->type->is_value_compatible_with(*type);
}
cols.push_back(column_info{
&desc.name.value,
type,
id,
type->value_length_if_fixed(),
type->is_multi_cell(),
type->is_counter(),
schema_mismatch
});
}
boost::range::stable_partition(cols, [](const column_info& column) { return !column.is_collection; });
}
return cols;
}
const sstable_enabled_features& features,
bool is_static);
utils::UUID schema_uuid;
std::vector<column_info> regular_schema_columns_from_sstable;
@@ -125,10 +86,10 @@ private:
state(state&&) = default;
state& operator=(state&&) = default;
state(const schema& s, const serialization_header& header)
state(const schema& s, const serialization_header& header, const sstable_enabled_features& features)
: schema_uuid(s.version())
, regular_schema_columns_from_sstable(build(s, header.regular_columns.elements, false))
, static_schema_columns_from_sstable(build(s, header.static_columns.elements, true))
, regular_schema_columns_from_sstable(build(s, header.regular_columns.elements, features, false))
, static_schema_columns_from_sstable(build(s, header.static_columns.elements, features, true))
, clustering_column_value_fix_lengths (get_clustering_values_fixed_lengths(header))
{}
};
@@ -136,9 +97,10 @@ private:
lw_shared_ptr<const state> _state = make_lw_shared<const state>();
public:
column_translation get_for_schema(const schema& s, const serialization_header& header) {
column_translation get_for_schema(
const schema& s, const serialization_header& header, const sstable_enabled_features& features) {
if (s.version() != _state->schema_uuid) {
_state = make_lw_shared(state(s, header));
_state = make_lw_shared(state(s, header, features));
}
return *this;
}

View File

@@ -708,8 +708,8 @@ future<> compaction_manager::perform_sstable_upgrade(column_family* cf, bool exc
// Note that we potentially could be doing multiple
// upgrades here in parallel, but that is really the users
// problem.
return rewrite_sstables(cf, sstables::compaction_options::make_upgrade(), [&](auto&) {
return tables;
return rewrite_sstables(cf, sstables::compaction_options::make_upgrade(), [&](auto&) mutable {
return std::exchange(tables, {});
});
});
});

View File

@@ -792,7 +792,11 @@ uint64_t time_window_compaction_strategy::adjust_partition_estimate(const mutati
}
const auto min_window = get_window_for(_options, *ms_meta.min_timestamp);
const auto max_window = get_window_for(_options, *ms_meta.max_timestamp);
return partition_estimate / (max_window - min_window + 1);
const auto window_size = get_window_size(_options);
auto estimated_window_count = (max_window + (window_size - 1) - min_window) / window_size;
return partition_estimate / std::max(1UL, uint64_t(estimated_window_count));
}
namespace {

View File

@@ -85,7 +85,7 @@ private:
} _state = state::START;
temporary_buffer<char> _key;
uint32_t _promoted_index_end;
uint64_t _promoted_index_end;
uint64_t _position;
uint64_t _partition_header_length = 0;
std::optional<deletion_time> _deletion_time;

View File

@@ -38,6 +38,8 @@
*/
#include "mp_row_consumer.hh"
#include "column_translation.hh"
#include "concrete_types.hh"
namespace sstables {
@@ -79,4 +81,86 @@ atomic_cell make_counter_cell(api::timestamp_type timestamp, bytes_view value) {
return ccb.build(timestamp);
}
// See #6130.
static data_type freeze_types_in_collections(data_type t) {
return ::visit(*t, make_visitor(
[] (const map_type_impl& typ) -> data_type {
return map_type_impl::get_instance(
freeze_types_in_collections(typ.get_keys_type()->freeze()),
freeze_types_in_collections(typ.get_values_type()->freeze()),
typ.is_multi_cell());
},
[] (const set_type_impl& typ) -> data_type {
return set_type_impl::get_instance(
freeze_types_in_collections(typ.get_elements_type()->freeze()),
typ.is_multi_cell());
},
[] (const list_type_impl& typ) -> data_type {
return list_type_impl::get_instance(
freeze_types_in_collections(typ.get_elements_type()->freeze()),
typ.is_multi_cell());
},
[&] (const abstract_type& typ) -> data_type {
return std::move(t);
}
));
}
/* If this function returns false, the caller cannot assume that the SSTable comes from Scylla.
* It might, if for some reason a table was created using Scylla that didn't contain any feature bit,
* but that should never happen. */
static bool is_certainly_scylla_sstable(const sstable_enabled_features& features) {
return features.enabled_features;
}
std::vector<column_translation::column_info> column_translation::state::build(
const schema& s,
const utils::chunked_vector<serialization_header::column_desc>& src,
const sstable_enabled_features& features,
bool is_static) {
std::vector<column_info> cols;
if (s.is_dense()) {
const column_definition& col = is_static ? *s.static_begin() : *s.regular_begin();
cols.push_back(column_info{
&col.name(),
col.type,
col.id,
col.type->value_length_if_fixed(),
col.is_multi_cell(),
col.is_counter(),
false
});
} else {
cols.reserve(src.size());
for (auto&& desc : src) {
const bytes& type_name = desc.type_name.value;
data_type type = db::marshal::type_parser::parse(to_sstring_view(type_name));
if (!features.is_enabled(CorrectUDTsInCollections) && is_certainly_scylla_sstable(features)) {
// See #6130.
type = freeze_types_in_collections(std::move(type));
}
const column_definition* def = s.get_column_definition(desc.name.value);
std::optional<column_id> id;
bool schema_mismatch = false;
if (def) {
id = def->id;
schema_mismatch = def->is_multi_cell() != type->is_multi_cell() ||
def->is_counter() != type->is_counter() ||
!def->type->is_value_compatible_with(*type);
}
cols.push_back(column_info{
&desc.name.value,
type,
id,
type->value_length_if_fixed(),
type->is_multi_cell(),
type->is_counter(),
schema_mismatch
});
}
boost::range::stable_partition(cols, [](const column_info& column) { return !column.is_collection; });
}
return cols;
}
}

View File

@@ -67,9 +67,13 @@ data_consume_rows<data_consume_rows_context_m>(const schema& s, shared_sstable,
static
position_in_partition_view get_slice_upper_bound(const schema& s, const query::partition_slice& slice, dht::ring_position_view key) {
const auto& ranges = slice.row_ranges(s, *key.key());
return ranges.empty()
? position_in_partition_view::for_static_row()
: position_in_partition_view::for_range_end(ranges.back());
if (ranges.empty()) {
return position_in_partition_view::for_static_row();
}
if (slice.options.contains(query::partition_slice::option::reversed)) {
return position_in_partition_view::for_range_end(ranges.front());
}
return position_in_partition_view::for_range_end(ranges.back());
}
GCC6_CONCEPT(

View File

@@ -1348,7 +1348,7 @@ public:
, _consumer(consumer)
, _sst(sst)
, _header(sst->get_serialization_header())
, _column_translation(sst->get_column_translation(s, _header))
, _column_translation(sst->get_column_translation(s, _header, sst->features()))
, _has_shadowable_tombstones(sst->has_shadowable_tombstones())
{
setup_columns(_regular_row, _column_translation.regular_columns());

View File

@@ -792,8 +792,9 @@ public:
const serialization_header& get_serialization_header() const {
return get_mutable_serialization_header(*_components);
}
column_translation get_column_translation(const schema& s, const serialization_header& h) {
return _column_translation.get_for_schema(s, h);
column_translation get_column_translation(
const schema& s, const serialization_header& h, const sstable_enabled_features& f) {
return _column_translation.get_for_schema(s, h, f);
}
const std::vector<unsigned>& get_shards_for_this_sstable() const {
return _shards;

View File

@@ -305,6 +305,11 @@ public:
get_window_for(const time_window_compaction_strategy_options& options, api::timestamp_type ts) {
return get_window_lower_bound(options.sstable_window_size, to_timestamp_type(options.timestamp_resolution, ts));
}
static api::timestamp_type
get_window_size(const time_window_compaction_strategy_options& options) {
return timestamp_type(std::chrono::duration_cast<std::chrono::microseconds>(options.get_sstable_window_size()).count());
}
private:
void update_estimated_compaction_by_tasks(std::map<timestamp_type, std::vector<shared_sstable>>& tasks, int min_threshold) {
int64_t n = 0;

View File

@@ -459,7 +459,8 @@ enum sstable_feature : uint8_t {
ShadowableTombstones = 2, // See #3885
CorrectStaticCompact = 3, // See #4139
CorrectEmptyCounters = 4, // See #4363
End = 5,
CorrectUDTsInCollections = 5, // See #6130
End = 6,
};
// Scylla-specific features enabled for a particular sstable.

View File

@@ -44,6 +44,7 @@
#include "streaming/stream_reason.hh"
#include "streaming/stream_mutation_fragments_cmd.hh"
#include "mutation_reader.hh"
#include "flat_mutation_reader.hh"
#include "frozen_mutation.hh"
#include "mutation.hh"
#include "message/messaging_service.hh"
@@ -203,15 +204,27 @@ future<> send_mutation_fragments(lw_shared_ptr<send_info> si) {
}();
auto sink_op = [sink, si, got_error_from_peer] () mutable -> future<> {
return do_with(std::move(sink), [si, got_error_from_peer] (rpc::sink<frozen_mutation_fragment, stream_mutation_fragments_cmd>& sink) {
return repeat([&sink, si, got_error_from_peer] () mutable {
return si->reader(db::no_timeout).then([&sink, si, s = si->reader.schema(), got_error_from_peer] (mutation_fragment_opt mf) mutable {
if (mf && !(*got_error_from_peer)) {
mutation_fragment_stream_validator validator(*(si->reader.schema()));
return do_with(std::move(sink), std::move(validator), [si, got_error_from_peer] (rpc::sink<frozen_mutation_fragment, stream_mutation_fragments_cmd>& sink, mutation_fragment_stream_validator& validator) {
return repeat([&sink, &validator, si, got_error_from_peer] () mutable {
return si->reader(db::no_timeout).then([&sink, &validator, si, s = si->reader.schema(), got_error_from_peer] (mutation_fragment_opt mf) mutable {
if (*got_error_from_peer) {
return make_exception_future<stop_iteration>(std::runtime_error("Got status error code from peer"));
}
if (mf) {
if (!validator(mf->mutation_fragment_kind())) {
return make_exception_future<stop_iteration>(std::runtime_error(format("Stream reader mutation_fragment validator failed, previous={}, current={}",
validator.previous_mutation_fragment_kind(), mf->mutation_fragment_kind())));
}
frozen_mutation_fragment fmf = freeze(*s, *mf);
auto size = fmf.representation().size();
streaming::get_local_stream_manager().update_progress(si->plan_id, si->id.addr, streaming::progress_info::direction::OUT, size);
return sink(fmf, stream_mutation_fragments_cmd::mutation_fragment_data).then([] { return stop_iteration::no; });
} else {
if (!validator.on_end_of_stream()) {
return make_exception_future<stop_iteration>(std::runtime_error(format("Stream reader mutation_fragment validator failed on end_of_stream, previous={}, current=end_of_stream",
validator.previous_mutation_fragment_kind())));
}
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
});

38
test.py
View File

@@ -203,6 +203,17 @@ class CqlTestSuite(TestSuite):
def pattern(self):
return "*_test.cql"
class RunTestSuite(TestSuite):
"""TestSuite for test directory with a 'run' script """
def add_test(self, shortname, mode, options):
test = RunTest(self.next_id, shortname, self, mode, options)
self.tests.append(test)
@property
def pattern(self):
return "run"
class Test:
"""Base class for CQL, Unit and Boost tests"""
@@ -332,6 +343,25 @@ class CqlTest(Test):
if self.is_equal_result is False:
print_unidiff(self.result, self.reject)
class RunTest(Test):
"""Run tests in a directory started by a run script"""
def __init__(self, test_no, shortname, suite, mode, options):
super().__init__(test_no, shortname, suite, mode, options)
self.path = os.path.join(suite.path, shortname)
self.xmlout = os.path.join(options.tmpdir, self.mode, "xml", self.uname + ".xunit.xml")
self.args = ["--junit-xml={}".format(self.xmlout)]
self.env = { 'SCYLLA': os.path.join("build", self.mode, "scylla") }
def print_summary(self):
print("Output of {} {}:".format(self.path, " ".join(self.args)))
print(read_log(self.log_filename))
async def run(self, options):
# This test can and should be killed gently, with SIGTERM, not with SIGKILL
self.success = await run_test(self, options, gentle_kill=True, env=self.env)
logging.info("Test #%d %s", self.id, "succeeded" if self.success else "failed ")
return self
class TabularConsoleOutput:
"""Print test progress to the console"""
@@ -375,7 +405,7 @@ class TabularConsoleOutput:
print(msg)
async def run_test(test, options):
async def run_test(test, options, gentle_kill=False, env=dict()):
"""Run test program, return True if success else False"""
with open(test.log_filename, "wb") as log:
@@ -407,6 +437,7 @@ async def run_test(test, options):
env=dict(os.environ,
UBSAN_OPTIONS=":".join(filter(None, UBSAN_OPTIONS)),
ASAN_OPTIONS=":".join(filter(None, ASAN_OPTIONS)),
**env,
),
preexec_fn=os.setsid,
)
@@ -423,7 +454,10 @@ async def run_test(test, options):
return True
except (asyncio.TimeoutError, asyncio.CancelledError) as e:
if process is not None:
process.kill()
if gentle_kill:
process.terminate()
else:
process.kill()
stdout, _ = await process.communicate()
if isinstance(e, asyncio.TimeoutError):
report_error("Test timed out")

View File

@@ -54,6 +54,8 @@ def pytest_addoption(parser):
parser.addoption("--https", action="store_true",
help="communicate via HTTPS protocol on port 8043 instead of HTTP when"
" running against a local Scylla installation")
parser.addoption("--url", action="store",
help="communicate with given URL instead of defaults")
# "dynamodb" fixture: set up client object for communicating with the DynamoDB
# API. Currently this chooses either Amazon's DynamoDB in the default region
@@ -70,7 +72,10 @@ def dynamodb(request):
# requires us to specify dummy region and credential parameters,
# otherwise the user is forced to properly configure ~/.aws even
# for local runs.
local_url = 'https://localhost:8043' if request.config.getoption('https') else 'http://localhost:8000'
if request.config.getoption('url') != None:
local_url = request.config.getoption('url')
else:
local_url = 'https://localhost:8043' if request.config.getoption('https') else 'http://localhost:8000'
# Disable verifying in order to be able to use self-signed TLS certificates
verify = not request.config.getoption('https')
return boto3.resource('dynamodb', endpoint_url=local_url, verify=verify,

View File

@@ -4,24 +4,30 @@
set -e
script_path=$(dirname $(readlink -e $0))
source_path=$script_path/../..
# By default, we take the latest build/*/scylla as the executable:
SCYLLA=${SCYLLA-$(ls -t "$script_path/../build/"*"/scylla" | head -1)}
SCYLLA=${SCYLLA-$(ls -t "$source_path/build/"*"/scylla" | head -1)}
SCYLLA=$(readlink -f "$SCYLLA")
SCYLLA_IP=${IP-127.0.0.1}
CPUSET=${CPUSET-0}
CQLSH=${CQLSH-cqlsh}
# We need to use cqlsh to set up the authentication credentials expected by
# some of the tests that check check authentication. If cqlsh is not installed
# there isn't much point of even starting Scylla
if ! type "$CQLSH" >/dev/null 2>&1
# Below, we need to use python3 and the Cassandra drive to set up the
# authentication credentials expected by some of the tests that check
# authentication. If they are not installed there isn't much point of
# even starting Scylla
if ! python3 -c 'from cassandra.cluster import Cluster' >/dev/null 2>&1
then
echo "Error: cannot find '$CQLSH', needed for configuring Alternator authentication." >&2
echo "Please install $CQLSH in your path, or set CQLSH to its location." >&2
echo "Error: python3 and python3-cassandra-driver must be installed to configure Alternator authentication." >&2
exit 1
fi
# Pick a loopback IP address for Scylla to run, in an attempt not to collide
# other concurrent runs of Scylla. CCM uses 127.0.0.<nodenum>, so if we use
# 127.1.*.* which cannot collide with it. Moreover, we'll take the last two
# bytes of the address from the current process - so as to allow multiple
# concurrent runs of this code to use a different address.
SCYLLA_IP=127.1.$(($$ >> 8 & 255)).$(($$ & 255))
echo "Running Scylla on $SCYLLA_IP"
tmp_dir=/tmp/alternator-test-$$
mkdir $tmp_dir
@@ -52,6 +58,7 @@ trap 'cleanup' EXIT
# to work. We only need to do this if the "--https" option was explicitly
# passed - otherwise the test would not use HTTPS anyway.
alternator_port_option="--alternator-port=8000"
alternator_url="http://$SCYLLA_IP:8000"
for i
do
if [ "$i" = --https ]
@@ -59,53 +66,61 @@ do
openssl genrsa 2048 > "$tmp_dir/scylla.key"
openssl req -new -x509 -nodes -sha256 -days 365 -subj "/C=IL/ST=None/L=None/O=None/OU=None/CN=example.com" -key "$tmp_dir/scylla.key" -out "$tmp_dir/scylla.crt"
alternator_port_option="--alternator-https-port=8043"
alternator_url="https://$SCYLLA_IP:8043"
fi
done
"$SCYLLA" --options-file "$script_path/../conf/scylla.yaml" \
--alternator-address $SCYLLA_IP \
"$SCYLLA" --options-file "$source_path/conf/scylla.yaml" \
--alternator-address $SCYLLA_IP \
$alternator_port_option \
--alternator-enforce-authorization=1 \
--experimental=on --developer-mode=1 \
--developer-mode=1 \
--ring-delay-ms 0 --collectd 0 \
--cpuset "$CPUSET" -m 1G \
--api-address $SCYLLA_IP --rpc-address $SCYLLA_IP \
--smp 2 -m 1G \
--overprovisioned --unsafe-bypass-fsync 1 \
--api-address $SCYLLA_IP \
--rpc-address $SCYLLA_IP \
--listen-address $SCYLLA_IP \
--prometheus-address $SCYLLA_IP \
--seed-provider-parameters seeds=$SCYLLA_IP \
--workdir "$tmp_dir" \
--server-encryption-options keyfile="$tmp_dir/scylla.key" \
--server-encryption-options certificate="$tmp_dir/scylla.crt" \
--auto-snapshot 0 \
--skip-wait-for-gossip-to-settle 0 \
>"$tmp_dir/log" 2>&1 &
SCYLLA_PROCESS=$!
# Set up the the proper authentication credentials needed by the Alternator
# test. This requires connecting to Scylla with cqlsh - we'll wait up for
# test. This requires connecting to Scylla with CQL - we'll wait up for
# one minute for this to work:
setup_authentication() {
python3 -c 'from cassandra.cluster import Cluster; Cluster(["'$SCYLLA_IP'"]).connect().execute("INSERT INTO system_auth.roles (role, salted_hash) VALUES ('\''alternator'\'', '\''secret_pass'\'')")'
}
echo "Scylla is: $SCYLLA."
echo -n "Booting Scylla..."
ok=
SECONDS=0
while ((SECONDS < 100))
while ((SECONDS < 200))
do
sleep 2
sleep 1
echo -n .
if ! kill -0 $SCYLLA_PROCESS 2>/dev/null
then
summary="Error: Scylla failed to boot after $SECONDS seconds."
break
fi
err=`"$CQLSH" -e "INSERT INTO system_auth.roles (role, salted_hash) VALUES ('alternator', 'secret_pass')" 2>&1` && ok=yes && break
err=`setup_authentication 2>&1` && ok=yes && break
case "$err" in
"Connection error:"*)
*NoHostAvailable:*)
# This is what we expect while Scylla is still booting.
;;
*"command not found")
summary="Error: need 'cqlsh' in your path, to configure Alternator authentication."
*ImportError:*|*"command not found"*)
summary="Error: need python3 and python3-cassandra-driver to configure Alternator authentication."
echo
echo $summary
break;;
*)
summary="Unknown cqlsh error, can't set authentication credentials: '$err'"
summary="Unknown error trying to set authentication credentials: '$err'"
echo
echo $summary
break;;
@@ -125,7 +140,8 @@ else
fi
cd "$script_path"
pytest "$@"
set +e
pytest --url $alternator_url "$@"
code=$?
case $code in
0) summary="Alternator tests pass";;

View File

@@ -0,0 +1 @@
type: Run

View File

@@ -305,3 +305,16 @@ def test_batch_get_item_projection_expression(test_table):
got_items = reply['Responses'][test_table.name]
expected_items = [{k: item[k] for k in wanted if k in item} for item in items]
assert multiset(got_items) == multiset(expected_items)
# Test that we return the required UnprocessedKeys/UnprocessedItems parameters
def test_batch_unprocessed(test_table_s):
p = random_string()
write_reply = test_table_s.meta.client.batch_write_item(RequestItems = {
test_table_s.name: [{'PutRequest': {'Item': {'p': p, 'a': 'hi'}}}],
})
assert 'UnprocessedItems' in write_reply and write_reply['UnprocessedItems'] == dict()
read_reply = test_table_s.meta.client.batch_get_item(RequestItems = {
test_table_s.name: {'Keys': [{'p': p}], 'ProjectionExpression': 'p, a', 'ConsistentRead': True}
})
assert 'UnprocessedKeys' in read_reply and read_reply['UnprocessedKeys'] == dict()

View File

@@ -20,6 +20,7 @@
import pytest
import requests
import json
from botocore.exceptions import BotoCoreError, ClientError
def gen_json(n):
@@ -112,3 +113,12 @@ def test_incorrect_json(dynamodb, test_table):
req = get_signed_request(dynamodb, 'PutItem', incorrect_req)
response = requests.post(req.url, headers=req.headers, data=req.body, verify=False)
assert validate_resp(response.text)
# Test that the value returned by PutItem is always a JSON object, not an empty string (see #6568)
def test_put_item_return_type(dynamodb, test_table):
payload = '{"TableName": "' + test_table.name + '", "Item": {"p": {"S": "x"}, "c": {"S": "x"}}}'
req = get_signed_request(dynamodb, 'PutItem', payload)
response = requests.post(req.url, headers=req.headers, data=req.body, verify=False)
assert response.text
# json::loads throws on invalid input
json.loads(response.text)

View File

@@ -100,6 +100,14 @@ def test_query_basic_restrictions(dynamodb, filled_test_table):
print(got_items)
assert multiset([item for item in items if item['p'] == 'long' and item['c'].startswith('11')]) == multiset(got_items)
def test_query_nonexistent_table(dynamodb):
client = dynamodb.meta.client
with pytest.raises(ClientError, match="ResourceNotFoundException"):
client.query(TableName="i_do_not_exist", KeyConditions={
'p' : {'AttributeValueList': ['long'], 'ComparisonOperator': 'EQ'},
'c' : {'AttributeValueList': ['11'], 'ComparisonOperator': 'BEGINS_WITH'}
})
def test_begins_with(dynamodb, test_table):
paginator = dynamodb.meta.client.get_paginator('query')
items = [{'p': 'unorthodox_chars', 'c': sort_key, 'str': 'a'} for sort_key in [u'ÿÿÿ', u'cÿbÿ', u'cÿbÿÿabg'] ]
@@ -451,7 +459,6 @@ def test_query_limit_paging(test_table_sn):
# return items sorted in reverse order. Combining this with Limit can
# be used to return the last items instead of the first items of the
# partition.
@pytest.mark.xfail(reason="ScanIndexForward not supported yet")
def test_query_reverse(test_table_sn):
numbers = [Decimal(i) for i in range(20)]
# Insert these numbers, in random order, into one partition:
@@ -486,7 +493,6 @@ def test_query_reverse(test_table_sn):
# Test that paging also works properly with reverse order
# (ScanIndexForward=false), i.e., reverse-order queries can be resumed
@pytest.mark.xfail(reason="ScanIndexForward not supported yet")
def test_query_reverse_paging(test_table_sn):
numbers = [Decimal(i) for i in range(20)]
# Insert these numbers, in random order, into one partition:

View File

@@ -42,6 +42,11 @@ def test_scan_basic(filled_test_table):
assert len(items) == len(got_items)
assert multiset(items) == multiset(got_items)
def test_scan_nonexistent_table(dynamodb):
client = dynamodb.meta.client
with pytest.raises(ClientError, match="ResourceNotFoundException"):
client.scan(TableName="i_do_not_exist")
def test_scan_with_paginator(dynamodb, filled_test_table):
test_table, items = filled_test_table
paginator = dynamodb.meta.client.get_paginator('scan')
@@ -239,7 +244,6 @@ def test_scan_select(filled_test_table):
# a scan into multiple parts, and that these parts are in fact disjoint,
# and their union is the entire contents of the table. We do not actually
# try to run these queries in *parallel* in this test.
@pytest.mark.xfail(reason="parallel scan not supported yet")
def test_scan_parallel(filled_test_table):
test_table, items = filled_test_table
for nsegments in [1, 2, 17]:
@@ -250,3 +254,14 @@ def test_scan_parallel(filled_test_table):
# The following comparison verifies that each of the expected item
# in items was returned in one - and just one - of the segments.
assert multiset(items) == multiset(got_items)
# Test correct handling of incorrect parallel scan parameters.
# Most of the corner cases (like TotalSegments=0) are validated
# by boto3 itself, but some checks can still be performed.
def test_scan_parallel_incorrect(filled_test_table):
test_table, items = filled_test_table
with pytest.raises(ClientError, match='ValidationException.*Segment'):
full_scan(test_table, TotalSegments=1000001, Segment=0)
for segment in [7, 9]:
with pytest.raises(ClientError, match='ValidationException.*Segment'):
full_scan(test_table, TotalSegments=5, Segment=segment)

View File

@@ -244,11 +244,12 @@ def test_table_streams_off(dynamodb):
table.delete();
# DynamoDB doesn't allow StreamSpecification to be empty map - if it
# exists, it must have a StreamEnabled
with pytest.raises(ClientError, match='ValidationException'):
table = create_test_table(dynamodb, StreamSpecification={},
KeySchema=[{ 'AttributeName': 'p', 'KeyType': 'HASH' }],
AttributeDefinitions=[{ 'AttributeName': 'p', 'AttributeType': 'S' }]);
table.delete();
# Unfortunately, new versions of boto3 doesn't let us pass this...
#with pytest.raises(ClientError, match='ValidationException'):
# table = create_test_table(dynamodb, StreamSpecification={},
# KeySchema=[{ 'AttributeName': 'p', 'KeyType': 'HASH' }],
# AttributeDefinitions=[{ 'AttributeName': 'p', 'AttributeType': 'S' }]);
# table.delete();
# Unfortunately, boto3 doesn't allow us to pass StreamSpecification=None.
# This is what we had in issue #5796.

View File

@@ -132,6 +132,13 @@ BOOST_AUTO_TEST_CASE(test_big_decimal_div) {
test_div("-0.25", 10, "-0.02");
test_div("-0.26", 10, "-0.03");
test_div("-10E10", 3, "-3E10");
// Document a small oddity, 1e1 has -1 decimal places, so dividing
// it by 2 produces 0. This is not the behavior in cassandra, but
// scylla doesn't expose arithmetic operations, so this doesn't
// seem to be visible from CQL.
test_div("10", 2, "5");
test_div("1e1", 2, "0e1");
}
BOOST_AUTO_TEST_CASE(test_big_decimal_assignadd) {

View File

@@ -142,6 +142,19 @@ SEASTAR_TEST_CASE(test_decimal_to_bigint) {
});
}
SEASTAR_TEST_CASE(test_decimal_to_float) {
return do_with_cql_env_thread([&](auto& e) {
e.execute_cql("CREATE TABLE test (key text primary key, value decimal)").get();
e.execute_cql("INSERT INTO test (key, value) VALUES ('k1', 10)").get();
e.execute_cql("INSERT INTO test (key, value) VALUES ('k2', 1e1)").get();
auto v = e.execute_cql("SELECT key, CAST(value as float) from test").get0();
assert_that(v).is_rows().with_rows_ignore_order({
{{serialized("k1")}, {serialized(float(10))}},
{{serialized("k2")}, {serialized(float(10))}},
});
});
}
SEASTAR_TEST_CASE(test_varint_to_bigint) {
return do_with_cql_env_thread([&](auto& e) {
e.execute_cql("CREATE TABLE test (key text primary key, value varint)").get();

Some files were not shown because too many files have changed in this diff Show More