Compare commits

..

38 Commits

Author SHA1 Message Date
Yaron Kaikov
4ae9a56466 release: prepare for 4.0.11 2020-10-26 18:12:47 +02:00
Avi Kivity
0374c1d040 Update seastar submodule
* seastar 065a40b34a...748428930a (1):
  > append_challenged_posix_file_impl: allow destructing file with no queued work

Fixes #7285.
2020-10-19 15:06:24 +03:00
Botond Dénes
9cb0fe3b33 reader_permit: reader_resources: make true RAII class
Currently in all cases we first deduct the to-be-consumed resources,
then construct the `reader_resources` class to protect it (release it on
destruction). This is error prone as it relies on no exception being
thrown while constructing the `reader_resources`. Albeit the
`reader_resources` constructor is `noexcept` right now this might change
in the future and as the call sites relying on this are disconnected
from the declaration, the one modifying them might not notice.
To make this safe going forward, make the `reader_resources` a true RAII
class, consuming the units in its constructor and releasing them in its
destructor.

Refs: #7256

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200922150625.1253798-1-bdenes@scylladb.com>
(cherry picked from commit a0107ba1c6)
Message-Id: <20200924081408.236353-1-bdenes@scylladb.com>
2020-10-19 15:05:13 +03:00
Takuya ASADA
a813ff4da2 install.sh: set LC_ALL=en_US.UTF-8 on python3 thunk
scylla-python3 causes segfault when non-default locale specified.
As workaround for this, we need to set LC_ALL=en_US.UTF_8 on python3 thunk.

Fixes #7408

Closes #7414

(cherry picked from commit ff129ee030)
2020-10-18 15:03:04 +03:00
Avi Kivity
d5936147f4 Merge "materialized views: Fix undefined behavior on base table schema changes" from Tomasz
"
The view_info object, which is attached to the schema object of the
view, contains a data structure called
"base_non_pk_columns_in_view_pk". This data structure contains column
ids of the base table so is valid only for a particular version of the
base table schema. This data structure is used by materialized view
code to interpret mutations of the base table, those coming from base
table writes, or reads of the base table done as part of view updates
or view building.

The base table schema version of that data structure must match the
schema version of the mutation fragments, otherwise we hit undefined
behavior. This may include aborts, exceptions, segfaults, or data
corruption (e.g. writes landing in the wrong column in the view).

Before this patch, we could get schema version mismatch here after the
base table was altered. That's because the view schema did not change
when the base table was altered.

Another problem was that view building was using the current table's schema
to interpret the fragments and invoke view building. That's incorrect for two
reasons. First, fragments generated by a reader must be accessed only using
the reader's schema. Second, base_non_pk_columns_in_view_pk of the recorded
view ptrs may not longer match the current base table schema, which is used
to generate the view updates.

Part of the fix is to extract base_non_pk_columns_in_view_pk into a
third entity called base_dependent_view_info, which changes both on
base table schema changes and view schema changes.

It is managed by a shared pointer so that we can take immutable
snapshots of it, just like with schema_ptr. When starting the view
update, the base table schema_ptr and the corresponding
base_dependent_view_info have to match. So we must obtain them
atomically, and base_dependent_view_info cannot change during update.

Also, whenever the base table schema changes, we must update
base_dependent_view_infos of all attached views (atomically) so that
it matches the base table schema.

Fixes #7061.

Tests:

  - unit (dev)
  - [v1] manual (reproduced using scylla binary and cqlsh)
"

* tag 'mv-schema-mismatch-fix-v2' of github.com:tgrabiec/scylla:
  db: view: Refactor view_info::initialize_base_dependent_fields()
  tests: mv: Test dropping columns from base table
  db: view: Fix incorrect schema access during view building after base table schema changes
  schema: Call on_internal_error() when out of range id is passed to column_at()
  db: views: Fix undefined behavior on base table schema changes
  db: views: Introduce has_base_non_pk_columns_in_view_pk()

(cherry picked from commit 3daa49f098)
2020-10-06 17:12:28 +03:00
Juliusz Stasiewicz
a3d3b4e185 tracing: Fix error on slow batches
`trace_keyspace_helper::make_slow_query_mutation_data` expected a
"query" key in its parameters, which does not appear in case of
e.g. batches of prepared statements. This is example of failing
`record.parameters`:
```
...{"query[0]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"},
{"query[1]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"}...
```

In such case Scylla recorded no trace and said:
```
ERROR 2020-09-28 10:09:36,696 [shard 3] trace_keyspace_helper - No
"query" parameter set for a session requesting a slow_query_log record
```

Fix here is to leave query empty if not found. The users can still
retrieve the query contents from existing info.

Fixes #5843

Closes #7293

(cherry picked from commit 0afa738a8f)
2020-10-04 18:05:00 +03:00
Tomasz Grabiec
4ca2576c98 Merge "evictable_reader: validate buffer on reader recreation" from Botond
This series backports the evictable reader validation patchset (merged
as 97c99ea9f to master) to 4.1.

I only had to do changes to the tests.

Tests: unit(dev), some exception safety tests are failing with or
without my patchset

Fixes: #7208

* https://github.com/denesb/scylla.git denesb/evictable-reader-validate-buffer/backport-4.1:
  mutation_reader_test: add unit test for evictable reader self-validation
  evictable_reader: validate buffer after recreation the underlying
  evictable_reader: update_next_position(): only use peek'd position on partition boundary
  mutation_reader_test: add unit test for evictable reader range tombstone trimming
  evictable_reader: trim range tombstones to the read clustering range
  position_in_partition_view: add position_in_partition_view before_key() overload
  flat_mutation_reader: add buffer() accessor

(cherry picked from commit 7f3ffbc1c8)
2020-10-02 11:52:57 +02:00
Tomasz Grabiec
e99a0c7b89 schema: Fix race in schema version recalculation leading to stale schema version in gossip
Migration manager installs several feature change listeners:

    if (this_shard_id() == 0) {
        _feature_listeners.push_back(_feat.cluster_supports_view_virtual_columns().when_enabled(update_schema));
        _feature_listeners.push_back(_feat.cluster_supports_digest_insensitive_to_expiry().when_enabled(update_schema));
        _feature_listeners.push_back(_feat.cluster_supports_cdc().when_enabled(update_schema));
        _feature_listeners.push_back(_feat.cluster_supports_per_table_partitioners().when_enabled(update_schema));
    }

They will call update_schema_version_and_announce() when features are enabled, which does this:

    return update_schema_version(proxy, features).then([] (utils::UUID uuid) {
        return announce_schema_version(uuid);
    });

So it first updates the schema version and then publishes it via
gossip in announce_schema_version(). It is possible that the
announce_schema_version() part of the first schema change will be
deferred and will execute after the other four calls to
update_schema_version_and_announce(). It will install the old schema
version in gossip instead of the more recent one.

The fix is to serialize schema digest calculation and publishing.

Fixes #7200

(cherry picked from commit 1a57d641d1)
2020-10-01 18:18:53 +02:00
Yaron Kaikov
f8c7605657 release: prepare for 4.0.10 2020-09-28 20:33:24 +03:00
Avi Kivity
7b9e33dcd4 Update seastar submodule
* seastar e87ce4941c...065a40b34a (1):
  > lz4_fragmented_compressor: Fix buffer requirements

Fixes #6925.
2020-09-23 12:07:11 +03:00
Yaron Kaikov
d86a31097a release: prepare for 4.0.9 2020-09-17 14:24:32 +03:00
Nadav Har'El
bd9d6f8e45 alternator: fix corruption of PutItem operation in case of contention
This patch fixes a bug noted in issue #7218 - where PutItem operations
sometimes lose part of the item's data - some attributes were lost,
and the name of other attributes replaced by empty strings. The problem
happened when the write-isolation policy was LWT and there was contention
of writes to the same partition (not necessarily the same item).

To use CAS (a.k.a. LWT), Alternator builds an alternator::rmw_operation
object with an apply() function which takes the old contents of the item
(if needed) and a timestamp, and builds a mutation that the CAS should
apply. In the case of the PutItem operation, we wrongly assumed that apply()
will be called only once - so as an optimization the strings saved in the
put_item_operation were moved into the returned mutation. But this
optimization is wrong - when there is contention, apply() may be called
again when the changed proposed by the previous one was not accepted by
the Paxos protocol.

The fix is to change the one place where put_item_operation *moved* strings
out of the saved operations into the mutations, to be a copy. But to prevent
this sort of bug from reoccuring in future code, this patch enlists the
compiler to help us verify that it can't happen: The apply() function is
marked "const" - it can use the information in the operation to build the
mutation, but it can never modify this information or move things out of it,
so it will be fine to call this function twice.

The single output field that apply() does write (_return_attributes) is
marked "mutable" to allow the const apply() to write to it anyway. Because
apply() might be called twice, it is important that if some apply()
implementation sometimes sets _return_attributes, then it must always
set it (even if to the default, empty, value) on every call to apply().

The const apply() means that the compiler verfies for us that I didn't
forget to fix additional wrong std::move()s. Additionally, a test I wrote
to easily reproduce issue #7218 (which I will submit as a dtest later)
passes after this fix.

Fixes #7218.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200916064906.333420-1-nyh@scylladb.com>
(cherry picked from commit 5e8bdf6877)
2020-09-16 23:05:23 +03:00
Benny Halevy
11ef23e97a test: cql_query_test: test_cache_bypass: use table stats
test is currently flaky since system reads can happen
in the background and disturb the global row cache stats.

Use the table's row_cache stats instead.

Fixes #6773

Test: cql_query_test.test_cache_bypass(dev, debug)

Credit-to: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200811140521.421813-1-bhalevy@scylladb.com>
(cherry picked from commit 6deba1d0b4)
2020-09-16 18:20:30 +03:00
Asias He
2c0eac09ae migration_manager: Make sync_schema return error when node is down
sync_schema is supposed to make sure that this node knows about all
schema changes known by "nodes" that were made prior to this call.

Currently, when a node is down, the sync is sliently skipped.

To fix, add a flag to migration_task::run_may_throw to indicate that it
should fail if a node is down.

Fixes #4791

(cherry picked from commit 7ba821cbc0)
2020-09-16 16:01:44 +03:00
Dejan Mircevski
713a7269d0 cql3: Fix NULL reference in get_column_defs_for_filtering
There was a typo in get_column_defs_for_filtering(): it checked the
wrong pointer before dereferencing.  Add a test exposing the NULL
dereference and fix the typo.

Tests: unit (dev)

Fixes #7198.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
(cherry picked from commit 9d02f10c71)
2020-09-16 15:47:09 +03:00
Avi Kivity
1724301d4d reconcilable_result_builder: don't aggrevate out-of-memory condition during recovery
Consider an unpaged query that consumes all of available memory, despite
fea5067dfa which limits them (perhaps the
user raised the limit, or this is a system query). Eventually we will see a
bad_alloc which will abort the query and destroy this reconcilable_result_builder.

During destruction, we first destroy _memory_accounter, and then _result.
Destroying _memory_accounter resumes some continuations which can then
allocate memory synchronously when increasing the task queue to accomodate
them. We will then crash. Had we not crashed, we would immediately afterwards
release _result, freeing all the memory that we would ever need.

Fix by making _result the last member, so it is freed first.

Fixes #7240.

(cherry picked from commit 9421cfded4)
2020-09-16 15:41:10 +03:00
Avi Kivity
9971f2f5db Merge "Fix repair stalls in get_sync_boundary and apply_rows_on_master_in_thread" from Asias
"
This path set fixes stalls in repair that are caused by std::list merge and clear operations during test_latency_read_with_nemesis test.

Fixes #6940
Fixes #6975
Fixes #6976
"

* 'fix_repair_list_stall_merge_clear_v2' of github.com:asias/scylla:
  repair: Fix stall in apply_rows_on_master_in_thread and apply_rows_on_follower
  repair: Use clear_gently in get_sync_boundary to avoid stall
  utils: Add clear_gently
  repair: Use merge_to_gently to merge two lists
  utils: Add merge_to_gently

(cherry picked from commit 4547949420)
2020-09-10 13:15:01 +03:00
Avi Kivity
ee328c22ca repair: apply_rows_on_follower(): remove copy of repair_rows list
We copy a list, which was reported to generate a 15ms stall.

This is easily fixed by moving it instead, which is safe since this is
the last use of the variable.

Fixes #7115.

(cherry picked from commit 6ff12b7f79)
2020-09-10 11:53:55 +03:00
Avi Kivity
3a9c9a8a12 Update seastar submodule
* seastar 861b7edd61...e87ce4941c (1):
  > core/reactor: complete_timers(): restore previous scheduling group

Fixes #7184.
2020-09-07 11:28:55 +03:00
Raphael S. Carvalho
c03445871a compaction: Prevent non-regular compaction from picking compacting SSTables
After 8014c7124, cleanup can potentially pick a compacting SSTable.
Upgrade and scrub can also pick a compacting SSTable.
The problem is that table::candidates_for_compaction() was badly named.
It misleads the user into thinking that the SSTables returned are perfect
candidates for compaction, but manager still need to filter out the
compacting SSTables from the returned set. So it's being renamed.

When the same SSTable is compacted in parallel, the strategy invariant
can be broken like overlapping being introduced in LCS, and also
some deletion failures as more than one compaction process would try
to delete the same files.

Let's fix scrub, cleanup and ugprade by calling the manager function
which gets the correct candidates for compaction.

Fixes #6938.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811200135.25421-1-raphaelsc@scylladb.com>
(cherry picked from commit 11df96718a)
2020-09-06 18:41:12 +03:00
Takuya ASADA
565ac1b092 aws: update enhanced networking supported instance list
Sync enhanced networking supported instance list to latest one.

Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html

Fixes #6991

(cherry picked from commit 7cccb018b8)
2020-09-06 18:21:46 +03:00
Yaron Kaikov
7d1180b98f release: prepare for 4.0.8 2020-08-30 09:42:34 +03:00
Piotr Sarna
f258e6f6ee Merge 'counters: Fix filtering of counters' from Juliusz
Queries with `ALLOW FILTERING` and constraints on counter
values used to be rejected as "unimplemented". The reason
was a missing tri-comparator, which is added in this patch.

Fixes #5635

* jul-stas-5635-filtering-on-counters:
  cql/tests: Added test for filtering on counter columns
  counters: add comparator and remove `unimplemented` from restrictions

(cherry picked from commit c32faee657)
2020-08-27 18:42:30 +03:00
Avi Kivity
2708b0d664 Merge "repair: row_level: prevent deadlocks when repairing homogenous nodes" from Botond
"
This series backports the series "repair: row_level: prevent deadlocks
when repairing homogenous nodes" (merged as a9c7a1a86) to branch-4.1.
"

Fixes #6272

* 'repair-row-level-evictable-local-reader/branch-4.1' of https://github.com/denesb/scylla:
  repair: row_level: destroy reader on EOS or error
  repair: row_level: use evictable_reader for local reads
  mutation_reader: expose evictable_reader
  mutation_reader: evictable_reader: add auto_pause flag
  mutation_reader: make evictable_reader a flat_mutation_reader
  mutation_reader: s/inactive_shard_read/inactive_evictable_reader/
  mutation_reader: move inactive_shard_reader code up
  mutation_reader: fix indentation
  mutation_reader: shard_reader: extract remote_reader as evictable_reader
  mutation_reader: reader_lifecycle_policy: make semaphore() available early

(cherry picked from commit 59aa1834a7)
2020-08-27 17:44:27 +03:00
Asias He
e31ffbf2e6 compaction_manager: Avoid stall in perform_cleanup
The following stall was seen during a cleanup operation:

scylla: Reactor stalled for 16262 ms on shard 4.

| std::_MakeUniq<locator::tokens_iterator_impl>::__single_object std::make_unique<locator::tokens_iterator_impl, locator::tokens_iterator_impl&>(locator::tokens_iterator_impl&) at /usr/include/fmt/format.h:1158
|  (inlined by) locator::token_metadata::tokens_iterator::tokens_iterator(locator::token_metadata::tokens_iterator const&) at ./locator/token_metadata.cc:1602
| locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at simple_strategy.cc:?
|  (inlined by) locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at ./locator/simple_strategy.cc:56
| locator::abstract_replication_strategy::get_ranges(gms::inet_address, locator::token_metadata&) const at /usr/include/fmt/format.h:1158
| locator::abstract_replication_strategy::get_ranges(gms::inet_address) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_ranges_for_endpoint(seastar::basic_sstring<char, unsigned int, 15u, true> const&, gms::inet_address const&) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_local_ranges(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const at /usr/include/fmt/format.h:1158
|  (inlined by) operator() at ./sstables/compaction_manager.cc:691
|  (inlined by) _M_invoke at /usr/include/c++/9/bits/std_function.h:286
| std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>::operator()(table const&) const at /usr/include/fmt/format.h:1158
|  (inlined by) compaction_manager::rewrite_sstables(table*, sstables::compaction_options, std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>) at ./sstables/compaction_manager.cc:604
| compaction_manager::perform_cleanup(table*) at /usr/include/fmt/format.h:1158

To fix, we furturize the function to get local ranges and sstables.

In addition, this patch removes the dependency to global storage_service object.

Fixes #6662

(cherry picked from commit 07e253542d)
2020-08-27 13:11:39 +03:00
Raphael S. Carvalho
801994e299 sstables: optimize procedure that checks if a sstable needs cleanup
needs_cleanup() returns true if a sstable needs cleanup.

Turns out it's very slow because it iterates through all the local
ranges for all sstables in the set, making its complexity:
	O(num_sstables * local_ranges)

We can optimize it by taking into account that abstract_replication_strategy
documents that get_ranges() will return a list of ranges that is sorted
and non-overlapping. Compaction for cleanup already takes advantage of that
when checking if a given partition can be actually purged.

So needs_cleanup() can be optimized into O(num_sstables * log(local_ranges)).

With num_sstables=1000, RF=3, then local_ranges=256(num_tokens)*3, it means
the max # of checks performed will go from 768000 to ~9584.

Fixes #6730.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-2-raphaelsc@scylladb.com>
(cherry picked from commit cf352e7c14)
2020-08-27 13:11:37 +03:00
Raphael S. Carvalho
3b932078bf sstables: export needs_cleanup()
May be needed elsewhere, like in an unit test.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-1-raphaelsc@scylladb.com>
(cherry picked from commit a9eebdc778)
2020-08-27 13:11:24 +03:00
Asias He
608f62a0e9 abstract_replication_strategy: Add get_ranges_in_thread
Add a version that runs inside a seastar thread. The benefit is that
get_ranges can yield to avoid stalls.

Refs #6662

(cherry picked from commit 94995acedb)
2020-08-27 13:10:32 +03:00
Asias He
d8619d3320 abstract_replication_strategy: Add get_ranges which takes token_metadata
It is useful when the caller wants to calculate ranges using a
custom token_metadata.

It will be used soon in do_rebuild_replace_with_repair for replace
operation.

Refs: #5482
(cherry picked from commit b640614aa6)
2020-08-27 13:10:26 +03:00
Asias He
4f0c99a187 gossip: Fix race between shutdown message handler and apply_state_locally
1. The node1 is shutdown
2. The node1 sends shutdown message to node2
3. The node2 receives gossip shutdown message but the handler yields
4. The node1 is restarted
5. The node1 sends new gossip endpoint_state to node2, node2 applies the state
   in apply_state_locally and calls gossiper::handle_major_state_change
   and then calls gossiper::mark_alive
6. The shutdown message handler in step 3 resumes and sets status of node1 to SHUTDOWN
7. The gossiper::mark_alive fiber in step 5 resumes and calls gossiper::real_mark_alive,
   node2 will skip to mark node1 as alive because the status of node1 is
   SHUTDOWN. As a result, node1 is alive but it is not marked as UP by node2.

To fix, we serialize the two operations.

Fixes #7032

(cherry picked from commit e6ceec1685)
2020-08-27 11:16:10 +03:00
Nadav Har'El
ada79df082 alternator test: configurable temporary directory
The test/alternator/run script creates a temporary directory for the Scylla
database in /tmp. The assumption was that this is the fastest disk (usually
even a ramdisk) on the test machine, and we didn't need anything else from
it.

But it turns out that on some systems, /tmp is actually a slow disk, so
this patch adds a way to configure the temporary directory - if the TMPDIR
environment variable exists, it is used instead of /tmp. As before this
patch, a temporary subdirectry is created in $TMPDIR, and this subdirectory
is automatically deleted when the test ends.

The test.py script already passes an appropriate TMPDIR (testlog/$mode),
which after this patch the Alternator test will use instead of /tmp.

Fixes #6750

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200713193023.788634-1-nyh@scylladb.com>
(cherry picked from commit 8e3be5e7d6)
2020-08-26 19:48:45 +03:00
Nadav Har'El
1935f2b480 alternator: fix order conditions on binary attributes
We implemented the order operators (LT, GT, LE, GE, BETWEEN) incorrectly
for binary attributes: DynamoDB requires that the bytes be treated as
unsigned for the purpose of order (so byte 128 is higher than 127), but
our implementation uses Scylla's "bytes" type which has signed bytes.

The solution is simple - we can continue to use the "bytes" type, but
we need to use its compare_unsigned() function, not its "<" operator.

This bug affected conditional operations ("Expected" and
"ConditionExpression") and also filters ("QueryFilter", "ScanFilter",
"FilterExpression"). The bug did *not* affect Query's key conditions
("KeyConditions", "KeyConditionExpression") because those already
used Scylla's key comparison functions - which correctly compare binary
blobs as unsigned bytes (in fact, this is why we have the
compare_unsigned() function).

The patch also adds tests that reproduce the bugs in conditional
operations, and show that the bug did not exist in key conditions.

Fixes #6573

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200603084257.394136-1-nyh@scylladb.com>
(cherry picked from commit f6b1f45d69)
Manyally removed tests in test_key_conditions.py which didn't exist in this branch.
2020-08-26 19:28:47 +03:00
Avi Kivity
44a76ed231 Merge "Unregister RPC verbs on stop" from Pavel E
"
There are 5 services, that register their RPC handlers in messaging
service, but quite a few of them unregister them on stop.

Unregistering is somewhat critical, not just because it makes the
code look clean, but also because unregistration does wait for the
message processing to complete, thus avoiding use-after-free's in
the handlers.

In particular, several handlers call service::get_schema_for_write()
which, in turn, may end up in service::maybe_sync() calling for
the local migration manager instance. All those handlers' processing
must be waited for before stopping the migration manager.

The set brings the RPC handlers unregistration in sync with the
registration part.

tests: unit (dev)
       dtest (dev: simple_boot_shutdown, repair)
       start-stop by hands (dev)
fixes: #6904
"

* 'br-rpc-unregister-verbs' of https://github.com/xemul/scylla:
  main: Add missing calls to unregister RPC hanlers
  messaging: Add missing per-service unregistering methods
  messaging: Add missing handlers unregistration helpers
  streaming: Do not use db->invoke_on_all in vain
  storage_proxy: Detach rpc unregistration from stop
  main: Shorten call to storage_proxy::init_messaging_service

(cherry picked from commit 01b838e291)
2020-08-26 14:42:40 +03:00
Raphael S. Carvalho
aeb49f4915 cql3/statements: verify that counter column cannot be added into non-counter table
A check, to validate that counter column cannot be added into non-counter table,
is missing for alter table statement. Validation is performed when building new
schema, but it's limited to checking that a schema will not contain both counter
and non-counter columns.

Due to lack of validation, the added counter column could be incorrectly
persisted to the schema, but this results in a crash when setting the new
schema to its table. On restart, it can be confirmed that the schema change
was indeed persisted when describing the table.
This problem is fixed by doing proper validation for the alter table statement,
which consists of making sure a new counter column cannot be added to a
non-counter table.

The test cdc_disallow_cdc_for_counters_test is adjusted because one of its tests
was built on the assumption that counter column can be added into a non-counter
table.

Fixes #7065.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200824155709.34743-1-raphaelsc@scylladb.com>
(cherry picked from commit 1c29f0a43d)
2020-08-25 18:46:01 +03:00
Gleb Natapov
8d6b35ad20 lwt: fix possible leak of "prune" counter
If get_schema_for_read() fails "prune" counter will not be decremented.
The patch fixes it by creating RAI object earlier. Also return releasing
of a mutation in release_mutation() which was dropped by mistake.

Fixes #6124

Message-Id: <20200405080233.GA22509@scylladb.com>
(cherry picked from commit e5f7ccc4c8)
2020-08-23 19:29:06 +03:00
Takuya ASADA
b123700ebe dist/debian: disable debuginfo compression on .deb
Since older binutils on some distribution does not able to handle
compressed debuginfo generated on Fedora, we need to disable it.
However, debian packager force debuginfo compression since debian/compat = 9,
we have to uncompress them after compressed automatically.

Fixes #6982

(cherry picked from commit 75c2362c95)
2020-08-23 19:03:13 +03:00
Botond Dénes
6786b521f9 scylla-gdb.py: find_db(): don't return current shard's database for shard=0
The `shard` parameter of `find_db()` is optional and is defaulted to
`None`. When missing, the current shard's database instance is returned.
The problem is that the if condition checking this uses `not shard`,
which also evaluates to `True` if `shard == 0`, resulting in returning
the current shard's database instance for shard 0. Change the condition
to `shard is None` to avoid this.

Fixes: #7016
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200812091546.1704016-1-bdenes@scylladb.com>
(cherry picked from commit 4cfab59eb1)
2020-08-23 18:56:39 +03:00
Botond Dénes
fda0d1ae8e table: get_sstables_by_partition_key(): don't make a copy of selected sstables
Currently we assign the reference to the vector of selected sstables to
`auto sst`. This makes a copy and we pass this local variable to
`do_for_each()`, which will result in a use-after-free if the latter
defers.
Fix by not making a copy and instead just keep the reference.

Fixes: #7060

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200818091241.2341332-1-bdenes@scylladb.com>
(cherry picked from commit 78f94ba36a)
2020-08-19 00:02:22 +03:00
64 changed files with 2301 additions and 544 deletions

View File

@@ -1,7 +1,7 @@
#!/bin/sh
PRODUCT=scylla
VERSION=4.0.7
VERSION=4.0.11
if test -f version
then

View File

@@ -365,31 +365,35 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
struct cmp_lt {
template <typename T> bool operator()(const T& lhs, const T& rhs) const { return lhs < rhs; }
// We cannot use the normal comparison operators like "<" on the bytes
// type, because they treat individual bytes as signed but we need to
// compare them as *unsigned*. So we need a specialization for bytes.
bool operator()(const bytes& lhs, const bytes& rhs) const { return compare_unsigned(lhs, rhs) < 0; }
static constexpr const char* diagnostic = "LT operator";
};
struct cmp_le {
// bytes only has <, so we cannot use <=.
template <typename T> bool operator()(const T& lhs, const T& rhs) const { return lhs < rhs || lhs == rhs; }
template <typename T> bool operator()(const T& lhs, const T& rhs) const { return lhs <= rhs; }
bool operator()(const bytes& lhs, const bytes& rhs) const { return compare_unsigned(lhs, rhs) <= 0; }
static constexpr const char* diagnostic = "LE operator";
};
struct cmp_ge {
// bytes only has <, so we cannot use >=.
template <typename T> bool operator()(const T& lhs, const T& rhs) const { return rhs < lhs || lhs == rhs; }
template <typename T> bool operator()(const T& lhs, const T& rhs) const { return lhs >= rhs; }
bool operator()(const bytes& lhs, const bytes& rhs) const { return compare_unsigned(lhs, rhs) >= 0; }
static constexpr const char* diagnostic = "GE operator";
};
struct cmp_gt {
// bytes only has <, so we cannot use >.
template <typename T> bool operator()(const T& lhs, const T& rhs) const { return rhs < lhs; }
template <typename T> bool operator()(const T& lhs, const T& rhs) const { return lhs > rhs; }
bool operator()(const bytes& lhs, const bytes& rhs) const { return compare_unsigned(lhs, rhs) > 0; }
static constexpr const char* diagnostic = "GT operator";
};
// True if v is between lb and ub, inclusive. Throws if lb > ub.
template <typename T>
bool check_BETWEEN(const T& v, const T& lb, const T& ub) {
if (ub < lb) {
if (cmp_lt()(ub, lb)) {
throw api_error("ValidationException",
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
}

View File

@@ -887,15 +887,24 @@ class attribute_collector {
void add(bytes&& name, atomic_cell&& cell) {
collected.emplace(std::move(name), std::move(cell));
}
void add(const bytes& name, atomic_cell&& cell) {
collected.emplace(name, std::move(cell));
}
public:
attribute_collector() : collected(attrs_type()->get_keys_type()->as_less_comparator()) { }
void put(bytes&& name, bytes&& val, api::timestamp_type ts) {
add(std::move(name), atomic_cell::make_live(*bytes_type, ts, std::move(val), atomic_cell::collection_member::yes));
void put(bytes&& name, const bytes& val, api::timestamp_type ts) {
add(std::move(name), atomic_cell::make_live(*bytes_type, ts, val, atomic_cell::collection_member::yes));
}
void put(const bytes& name, const bytes& val, api::timestamp_type ts) {
add(name, atomic_cell::make_live(*bytes_type, ts, val, atomic_cell::collection_member::yes));
}
void del(bytes&& name, api::timestamp_type ts) {
add(std::move(name), atomic_cell::make_dead(ts, gc_clock::now()));
}
void del(const bytes& name, api::timestamp_type ts) {
add(name, atomic_cell::make_dead(ts, gc_clock::now()));
}
collection_mutation_description to_mut() {
collection_mutation_description ret;
for (auto&& e : collected) {
@@ -975,7 +984,7 @@ public:
put_or_delete_item(const rjson::value& item, schema_ptr schema, put_item);
// put_or_delete_item doesn't keep a reference to schema (so it can be
// moved between shards for LWT) so it needs to be given again to build():
mutation build(schema_ptr schema, api::timestamp_type ts);
mutation build(schema_ptr schema, api::timestamp_type ts) const;
const partition_key& pk() const { return _pk; }
const clustering_key& ck() const { return _ck; }
};
@@ -1004,7 +1013,7 @@ put_or_delete_item::put_or_delete_item(const rjson::value& item, schema_ptr sche
}
}
mutation put_or_delete_item::build(schema_ptr schema, api::timestamp_type ts) {
mutation put_or_delete_item::build(schema_ptr schema, api::timestamp_type ts) const {
mutation m(schema, _pk);
// If there's no clustering key, a tombstone should be created directly
// on a partition, not on a clustering row - otherwise it will look like
@@ -1026,7 +1035,7 @@ mutation put_or_delete_item::build(schema_ptr schema, api::timestamp_type ts) {
for (auto& c : *_cells) {
const column_definition* cdef = schema->get_column_definition(c.column_name);
if (!cdef) {
attrs_collector.put(std::move(c.column_name), std::move(c.value), ts);
attrs_collector.put(c.column_name, c.value, ts);
} else {
row.cells().apply(*cdef, atomic_cell::make_live(*cdef->type, ts, std::move(c.value)));
}
@@ -1326,7 +1335,7 @@ public:
check_needs_read_before_write(_condition_expression) ||
_returnvalues == returnvalues::ALL_OLD;
}
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) override {
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const override {
std::unordered_set<std::string> used_attribute_values;
std::unordered_set<std::string> used_attribute_names;
if (!verify_expected(_request, previous_item) ||
@@ -1338,6 +1347,7 @@ public:
// efficient than throwing an exception.
return {};
}
_return_attributes = {};
if (_returnvalues == returnvalues::ALL_OLD && previous_item) {
// previous_item is supposed to have been created with
// describe_item(), so has the "Item" attribute:
@@ -1404,7 +1414,7 @@ public:
check_needs_read_before_write(_condition_expression) ||
_returnvalues == returnvalues::ALL_OLD;
}
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) override {
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const override {
std::unordered_set<std::string> used_attribute_values;
std::unordered_set<std::string> used_attribute_names;
if (!verify_expected(_request, previous_item) ||
@@ -1416,6 +1426,7 @@ public:
// efficient than throwing an exception.
return {};
}
_return_attributes = {};
if (_returnvalues == returnvalues::ALL_OLD && previous_item) {
rjson::value* item = rjson::find(*previous_item, "Item");
if (item) {
@@ -1499,7 +1510,7 @@ public:
virtual ~put_or_delete_item_cas_request() = default;
virtual std::optional<mutation> apply(query::result& qr, const query::partition_slice& slice, api::timestamp_type ts) override {
std::optional<mutation> ret;
for (put_or_delete_item& mutation_builder : _mutation_builders) {
for (const put_or_delete_item& mutation_builder : _mutation_builders) {
// We assume all these builders have the same partition.
if (ret) {
ret->apply(mutation_builder.build(schema, ts));
@@ -2324,7 +2335,7 @@ public:
update_item_operation(service::storage_proxy& proxy, rjson::value&& request);
virtual ~update_item_operation() = default;
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) override;
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const override;
bool needs_read_before_write() const;
};
@@ -2388,7 +2399,7 @@ update_item_operation::needs_read_before_write() const {
}
std::optional<mutation>
update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) {
update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const {
std::unordered_set<std::string> used_attribute_values;
std::unordered_set<std::string> used_attribute_names;
if (!verify_expected(_request, previous_item) ||

View File

@@ -83,7 +83,11 @@ protected:
// When _returnvalues != NONE, apply() should store here, in JSON form,
// the values which are to be returned in the "Attributes" field.
// The default null JSON means do not return an Attributes field at all.
rjson::value _return_attributes;
// This field is marked "mutable" so that the const apply() can modify
// it (see explanation below), but note that because apply() may be
// called more than once, if apply() will sometimes set this field it
// must set it (even if just to the default empty value) every time.
mutable rjson::value _return_attributes;
public:
// The constructor of a rmw_operation subclass should parse the request
// and try to discover as many input errors as it can before really
@@ -96,7 +100,12 @@ public:
// conditional expression, apply() should return an empty optional.
// apply() may throw if it encounters input errors not discovered during
// the constructor.
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) = 0;
// apply() may be called more than once in case of contention, so it must
// not change the state saved in the object (issue #7218 was caused by
// violating this). We mark apply() "const" to let the compiler validate
// this for us. The output-only field _return_attributes is marked
// "mutable" above so that apply() can still write to it.
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const = 0;
// Convert the above apply() into the signature needed by cas_request:
virtual std::optional<mutation> apply(query::result& qr, const query::partition_slice& slice, api::timestamp_type ts) override;
virtual ~rmw_operation() = default;

View File

@@ -252,8 +252,8 @@ void set_storage_service(http_context& ctx, routes& r) {
for (auto cf : column_families) {
column_families_vec.push_back(&db.find_column_family(keyspace, cf));
}
return parallel_for_each(column_families_vec, [&cm] (column_family* cf) {
return cm.perform_cleanup(cf);
return parallel_for_each(column_families_vec, [&cm, &db] (column_family* cf) {
return cm.perform_cleanup(db, cf);
});
}).then([]{
return make_ready_future<json::json_return_type>(0);

View File

@@ -381,6 +381,7 @@ scylla_tests = set([
'test/boost/view_schema_ckey_test',
'test/boost/vint_serialization_test',
'test/boost/virtual_reader_test',
'test/boost/stall_free_test',
'test/manual/ec2_snitch_test',
'test/manual/gce_snitch_test',
'test/manual/gossip',

View File

@@ -417,7 +417,7 @@ std::vector<const column_definition*> statement_restrictions::get_column_defs_fo
_clustering_columns_restrictions->num_prefix_columns_that_need_not_be_filtered();
for (auto&& cdef : _clustering_columns_restrictions->get_column_defs()) {
::shared_ptr<single_column_restriction> restr;
if (single_pk_restrs) {
if (single_ck_restrs) {
auto it = single_ck_restrs->restrictions().find(cdef);
if (it != single_ck_restrs->restrictions().end()) {
restr = dynamic_pointer_cast<single_column_restriction>(it->second);
@@ -624,9 +624,6 @@ bool single_column_restriction::EQ::is_satisfied_by(const schema& schema,
const row& cells,
const query_options& options,
gc_clock::time_point now) const {
if (_column_def.type->is_counter()) {
fail(unimplemented::cause::COUNTERS);
}
auto operand = value(options);
if (operand) {
auto cell_value = get_value(schema, key, ckey, cells, now);
@@ -641,9 +638,6 @@ bool single_column_restriction::EQ::is_satisfied_by(const schema& schema,
}
bool single_column_restriction::EQ::is_satisfied_by(bytes_view data, const query_options& options) const {
if (_column_def.type->is_counter()) {
fail(unimplemented::cause::COUNTERS);
}
auto operand = value(options);
return operand && _column_def.type->compare(*operand, data) == 0;
}
@@ -654,9 +648,6 @@ bool single_column_restriction::IN::is_satisfied_by(const schema& schema,
const row& cells,
const query_options& options,
gc_clock::time_point now) const {
if (_column_def.type->is_counter()) {
fail(unimplemented::cause::COUNTERS);
}
auto cell_value = get_value(schema, key, ckey, cells, now);
if (!cell_value) {
return false;
@@ -670,9 +661,6 @@ bool single_column_restriction::IN::is_satisfied_by(const schema& schema,
}
bool single_column_restriction::IN::is_satisfied_by(bytes_view data, const query_options& options) const {
if (_column_def.type->is_counter()) {
fail(unimplemented::cause::COUNTERS);
}
auto operands = values(options);
return boost::algorithm::any_of(operands, [this, &data] (const bytes_opt& operand) {
return operand && _column_def.type->compare(*operand, data) == 0;
@@ -722,9 +710,6 @@ bool single_column_restriction::slice::is_satisfied_by(const schema& schema,
}
bool single_column_restriction::slice::is_satisfied_by(bytes_view data, const query_options& options) const {
if (_column_def.type->is_counter()) {
fail(unimplemented::cause::COUNTERS);
}
return contains_without_wraparound(to_range(_slice, options),
data, _column_def.type->underlying_type()->as_tri_comparator());
}
@@ -735,9 +720,6 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
const row& cells,
const query_options& options,
gc_clock::time_point now) const {
if (_column_def.type->is_counter()) {
fail(unimplemented::cause::COUNTERS);
}
if (!_column_def.type->is_collection()) {
return false;
}

View File

@@ -207,6 +207,9 @@ void alter_table_statement::add_column(const schema& schema, const table& cf, sc
"because a collection with the same name and a different type has already been used in the past", column_name));
}
}
if (type->is_counter() && !schema.is_counter()) {
throw exceptions::configuration_exception(format("Cannot add a counter column ({}) in a non counter column family", column_name));
}
cfm.with_column(column_name.name(), type, is_static ? column_kind::static_column : column_kind::regular_column);
@@ -222,7 +225,7 @@ void alter_table_statement::add_column(const schema& schema, const table& cf, sc
schema_builder builder(view);
if (view->view_info()->include_all_columns()) {
builder.with_column(column_name.name(), type);
} else if (view->view_info()->base_non_pk_columns_in_view_pk().empty()) {
} else if (!view->view_info()->has_base_non_pk_columns_in_view_pk()) {
db::view::create_virtual_column(builder, column_name.name(), type);
}
view_updates.push_back(view_ptr(builder.build()));

View File

@@ -2009,9 +2009,10 @@ flat_mutation_reader make_multishard_streaming_reader(distributed<database>& db,
reader_concurrency_semaphore* semaphore;
};
distributed<database>& _db;
utils::UUID _table_id;
std::vector<reader_context> _contexts;
public:
explicit streaming_reader_lifecycle_policy(distributed<database>& db) : _db(db), _contexts(smp::count) {
streaming_reader_lifecycle_policy(distributed<database>& db, utils::UUID table_id) : _db(db), _table_id(table_id), _contexts(smp::count) {
}
virtual flat_mutation_reader create_reader(
schema_ptr schema,
@@ -2040,7 +2041,12 @@ flat_mutation_reader make_multishard_streaming_reader(distributed<database>& db,
});
}
virtual reader_concurrency_semaphore& semaphore() override {
return *_contexts[engine().cpu_id()].semaphore;
const auto shard = engine().cpu_id();
if (!_contexts[shard].semaphore) {
auto& cf = _db.local().find_column_family(_table_id);
_contexts[shard].semaphore = &cf.streaming_read_concurrency_semaphore();
}
return *_contexts[shard].semaphore;
}
};
auto ms = mutation_source([&db] (schema_ptr s,
@@ -2051,7 +2057,8 @@ flat_mutation_reader make_multishard_streaming_reader(distributed<database>& db,
tracing::trace_state_ptr trace_state,
streamed_mutation::forwarding,
mutation_reader::forwarding fwd_mr) {
return make_multishard_combining_reader(make_shared<streaming_reader_lifecycle_policy>(db), std::move(s), pr, ps, pc,
auto table_id = s->id();
return make_multishard_combining_reader(make_shared<streaming_reader_lifecycle_policy>(db, table_id), std::move(s), pr, ps, pc,
std::move(trace_state), fwd_mr);
});
auto&& full_slice = schema->full_slice();

View File

@@ -55,6 +55,7 @@
#include <limits>
#include <cstddef>
#include "schema_fwd.hh"
#include "db/view/view.hh"
#include "db/schema_features.hh"
#include "gms/feature.hh"
#include "timestamp.hh"
@@ -885,7 +886,7 @@ public:
lw_shared_ptr<const sstable_list> get_sstables_including_compacted_undeleted() const;
const std::vector<sstables::shared_sstable>& compacted_undeleted_sstables() const;
std::vector<sstables::shared_sstable> select_sstables(const dht::partition_range& range) const;
std::vector<sstables::shared_sstable> candidates_for_compaction() const;
std::vector<sstables::shared_sstable> non_staging_sstables() const;
std::vector<sstables::shared_sstable> sstables_need_rewrite() const;
size_t sstables_count() const;
std::vector<uint64_t> sstable_count_per_level() const;
@@ -981,8 +982,9 @@ public:
return *_config.sstables_manager;
}
// Reader's schema must be the same as the base schema of each of the views.
future<> populate_views(
std::vector<view_ptr>,
std::vector<db::view::view_and_base>,
dht::token base_token,
flat_mutation_reader&&);
@@ -998,7 +1000,7 @@ private:
future<row_locker::lock_holder> do_push_view_replica_updates(const schema_ptr& s, mutation&& m, db::timeout_clock::time_point timeout, mutation_source&& source, const io_priority_class& io_priority) const;
std::vector<view_ptr> affected_views(const schema_ptr& base, const mutation& update) const;
future<> generate_and_propagate_view_updates(const schema_ptr& base,
std::vector<view_ptr>&& views,
std::vector<db::view::view_and_base>&& views,
mutation&& m,
flat_mutation_reader_opt existings) const;

View File

@@ -822,6 +822,14 @@ future<> merge_schema(distributed<service::storage_proxy>& proxy, gms::feature_s
});
}
future<> recalculate_schema_version(distributed<service::storage_proxy>& proxy, gms::feature_service& feat) {
return merge_lock().then([&proxy, &feat] {
return update_schema_version_and_announce(proxy, feat.cluster_schema_features());
}).finally([] {
return merge_unlock();
});
}
future<> merge_schema(distributed<service::storage_proxy>& proxy, std::vector<mutation> mutations, bool do_flush)
{
return merge_lock().then([&proxy, mutations = std::move(mutations), do_flush] () mutable {

View File

@@ -170,6 +170,13 @@ future<> merge_schema(distributed<service::storage_proxy>& proxy, gms::feature_s
future<> merge_schema(distributed<service::storage_proxy>& proxy, std::vector<mutation> mutations, bool do_flush);
// Recalculates the local schema version and publishes it in gossip.
//
// It is safe to call concurrently with recalculate_schema_version() and merge_schema() in which case it
// is guaranteed that the schema version we end up with after all calls will reflect the most recent state
// of feature_service and schema tables.
future<> recalculate_schema_version(distributed<service::storage_proxy>& proxy, gms::feature_service& feat);
future<std::set<sstring>> merge_keyspaces(distributed<service::storage_proxy>& proxy, schema_result&& before, schema_result&& after);
std::vector<mutation> make_create_keyspace_mutations(lw_shared_ptr<keyspace_metadata> keyspace, api::timestamp_type timestamp, bool with_tables_and_types_and_functions = true);

View File

@@ -130,17 +130,26 @@ const column_definition* view_info::view_column(const column_definition& base_de
return _schema.get_column_definition(base_def.name());
}
const std::vector<column_id>& view_info::base_non_pk_columns_in_view_pk() const {
return _base_non_pk_columns_in_view_pk;
void view_info::set_base_info(db::view::base_info_ptr base_info) {
_base_info = std::move(base_info);
}
void view_info::initialize_base_dependent_fields(const schema& base) {
db::view::base_info_ptr view_info::make_base_dependent_view_info(const schema& base) const {
std::vector<column_id> base_non_pk_columns_in_view_pk;
for (auto&& view_col : boost::range::join(_schema.partition_key_columns(), _schema.clustering_key_columns())) {
auto* base_col = base.get_column_definition(view_col.name());
if (base_col && !base_col->is_primary_key()) {
_base_non_pk_columns_in_view_pk.push_back(base_col->id);
base_non_pk_columns_in_view_pk.push_back(base_col->id);
}
}
return make_lw_shared<db::view::base_dependent_view_info>({
.base_schema = base.shared_from_this(),
.base_non_pk_columns_in_view_pk = std::move(base_non_pk_columns_in_view_pk)
});
}
bool view_info::has_base_non_pk_columns_in_view_pk() const {
return !_base_info->base_non_pk_columns_in_view_pk.empty();
}
namespace db {
@@ -188,11 +197,11 @@ bool may_be_affected_by(const schema& base, const view_info& view, const dht::de
}
static bool update_requires_read_before_write(const schema& base,
const std::vector<view_ptr>& views,
const std::vector<view_and_base>& views,
const dht::decorated_key& key,
const rows_entry& update) {
for (auto&& v : views) {
view_info& vf = *v->view_info();
view_info& vf = *v.view->view_info();
if (may_be_affected_by(base, vf, key, update)) {
return true;
}
@@ -239,12 +248,14 @@ class view_updates final {
view_ptr _view;
const view_info& _view_info;
schema_ptr _base;
base_info_ptr _base_info;
std::unordered_map<partition_key, mutation_partition, partition_key::hashing, partition_key::equality> _updates;
public:
explicit view_updates(view_ptr view, schema_ptr base)
: _view(std::move(view))
explicit view_updates(view_and_base vab)
: _view(std::move(vab.view))
, _view_info(*_view->view_info())
, _base(std::move(base))
, _base(vab.base->base_schema)
, _base_info(vab.base)
, _updates(8, partition_key::hashing(*_view), partition_key::equality(*_view)) {
}
@@ -306,7 +317,7 @@ row_marker view_updates::compute_row_marker(const clustering_row& base_row) cons
// they share liveness information. It's true especially in the only case currently allowed by CQL,
// which assumes there's up to one non-pk column in the view key. It's also true in alternator,
// which does not carry TTL information.
const auto& col_ids = _view_info.base_non_pk_columns_in_view_pk();
const auto& col_ids = _base_info->base_non_pk_columns_in_view_pk;
if (!col_ids.empty()) {
auto& def = _base->regular_column_at(col_ids[0]);
// Note: multi-cell columns can't be part of the primary key.
@@ -537,7 +548,7 @@ void view_updates::delete_old_entry(const partition_key& base_key, const cluster
void view_updates::do_delete_old_entry(const partition_key& base_key, const clustering_row& existing, const clustering_row& update, gc_clock::time_point now) {
auto& r = get_view_row(base_key, existing);
const auto& col_ids = _view_info.base_non_pk_columns_in_view_pk();
const auto& col_ids = _base_info->base_non_pk_columns_in_view_pk;
if (!col_ids.empty()) {
// We delete the old row using a shadowable row tombstone, making sure that
// the tombstone deletes everything in the row (or it might still show up).
@@ -678,7 +689,7 @@ void view_updates::generate_update(
return;
}
const auto& col_ids = _view_info.base_non_pk_columns_in_view_pk();
const auto& col_ids = _base_info->base_non_pk_columns_in_view_pk;
if (col_ids.empty()) {
// The view key is necessarily the same pre and post update.
if (existing && existing->is_live(*_base)) {
@@ -932,11 +943,16 @@ future<stop_iteration> view_update_builder::on_results() {
future<std::vector<frozen_mutation_and_schema>> generate_view_updates(
const schema_ptr& base,
std::vector<view_ptr>&& views_to_update,
std::vector<view_and_base>&& views_to_update,
flat_mutation_reader&& updates,
flat_mutation_reader_opt&& existings) {
auto vs = boost::copy_range<std::vector<view_updates>>(views_to_update | boost::adaptors::transformed([&] (auto&& v) {
return view_updates(std::move(v), base);
auto vs = boost::copy_range<std::vector<view_updates>>(views_to_update | boost::adaptors::transformed([&] (view_and_base v) {
if (base->version() != v.base->base_schema->version()) {
on_internal_error(vlogger, format("Schema version used for view updates ({}) does not match the current"
" base schema version of the view ({}) for view {}.{} of {}.{}",
base->version(), v.base->base_schema->version(), v.view->ks_name(), v.view->cf_name(), base->ks_name(), base->cf_name()));
}
return view_updates(std::move(v));
}));
auto builder = std::make_unique<view_update_builder>(base, std::move(vs), std::move(updates), std::move(existings));
auto f = builder->build();
@@ -946,18 +962,18 @@ future<std::vector<frozen_mutation_and_schema>> generate_view_updates(
query::clustering_row_ranges calculate_affected_clustering_ranges(const schema& base,
const dht::decorated_key& key,
const mutation_partition& mp,
const std::vector<view_ptr>& views) {
const std::vector<view_and_base>& views) {
std::vector<nonwrapping_range<clustering_key_prefix_view>> row_ranges;
std::vector<nonwrapping_range<clustering_key_prefix_view>> view_row_ranges;
clustering_key_prefix_view::tri_compare cmp(base);
if (mp.partition_tombstone() || !mp.row_tombstones().empty()) {
for (auto&& v : views) {
// FIXME: #2371
if (v->view_info()->select_statement().get_restrictions()->has_unrestricted_clustering_columns()) {
if (v.view->view_info()->select_statement().get_restrictions()->has_unrestricted_clustering_columns()) {
view_row_ranges.push_back(nonwrapping_range<clustering_key_prefix_view>::make_open_ended_both_sides());
break;
}
for (auto&& r : v->view_info()->partition_slice().default_row_ranges()) {
for (auto&& r : v.view->view_info()->partition_slice().default_row_ranges()) {
view_row_ranges.push_back(r.transform(std::mem_fn(&clustering_key_prefix::view)));
}
}
@@ -1717,7 +1733,7 @@ public:
return stop_iteration::yes;
}
_fragments_memory_usage += cr.memory_usage(*_step.base->schema());
_fragments_memory_usage += cr.memory_usage(*_step.reader.schema());
_fragments.push_back(std::move(cr));
if (_fragments_memory_usage > batch_memory_max) {
// Although we have not yet completed the batch of base rows that
@@ -1737,10 +1753,14 @@ public:
_builder._as.check();
if (!_fragments.empty()) {
_fragments.push_front(partition_start(_step.current_key, tombstone()));
auto base_schema = _step.base->schema();
auto views = with_base_info_snapshot(_views_to_build);
auto reader = make_flat_mutation_reader_from_fragments(_step.reader.schema(), std::move(_fragments));
reader.upgrade_schema(base_schema);
_step.base->populate_views(
_views_to_build,
std::move(views),
_step.current_token(),
make_flat_mutation_reader_from_fragments(_step.base->schema(), std::move(_fragments))).get();
std::move(reader)).get();
_fragments.clear();
_fragments_memory_usage = 0;
}
@@ -1887,5 +1907,11 @@ future<bool> check_needs_view_update_path(db::system_distributed_keyspace& sys_d
});
}
std::vector<db::view::view_and_base> with_base_info_snapshot(std::vector<view_ptr> vs) {
return boost::copy_range<std::vector<db::view::view_and_base>>(vs | boost::adaptors::transformed([] (const view_ptr& v) {
return db::view::view_and_base{v, v->view_info()->base_info()};
}));
}
} // namespace view
} // namespace db

View File

@@ -43,6 +43,27 @@ namespace db {
namespace view {
// Part of the view description which depends on the base schema version.
//
// This structure may change even though the view schema doesn't change, so
// it needs to live outside view_ptr.
struct base_dependent_view_info {
schema_ptr base_schema;
// Id of a regular base table column included in the view's PK, if any.
// Scylla views only allow one such column, alternator can have up to two.
std::vector<column_id> base_non_pk_columns_in_view_pk;
};
// Immutable snapshot of view's base-schema-dependent part.
using base_info_ptr = lw_shared_ptr<const base_dependent_view_info>;
// Snapshot of the view schema and its base-schema-dependent part.
struct view_and_base {
view_ptr view;
base_info_ptr base;
};
/**
* Whether the view filter considers the specified partition key.
*
@@ -92,7 +113,7 @@ bool clustering_prefix_matches(const schema& base, const partition_key& key, con
future<std::vector<frozen_mutation_and_schema>> generate_view_updates(
const schema_ptr& base,
std::vector<view_ptr>&& views_to_update,
std::vector<view_and_base>&& views_to_update,
flat_mutation_reader&& updates,
flat_mutation_reader_opt&& existings);
@@ -100,7 +121,7 @@ query::clustering_row_ranges calculate_affected_clustering_ranges(
const schema& base,
const dht::decorated_key& key,
const mutation_partition& mp,
const std::vector<view_ptr>& views);
const std::vector<view_and_base>& views);
struct wait_for_all_updates_tag {};
using wait_for_all_updates = bool_class<wait_for_all_updates_tag>;
@@ -128,6 +149,13 @@ future<> mutate_MV(
*/
void create_virtual_column(schema_builder& builder, const bytes& name, const data_type& type);
/**
* Converts a collection of view schema snapshots into a collection of
* view_and_base objects, which are snapshots of both the view schema
* and the base-schema-dependent part of view description.
*/
std::vector<view_and_base> with_base_info_snapshot(std::vector<view_ptr>);
}
}

View File

@@ -182,7 +182,7 @@ class aws_instance:
instance_size = self.instance_size()
if instance_class in ['c3', 'c4', 'd2', 'i2', 'r3']:
return 'ixgbevf'
if instance_class in ['a1', 'c5', 'c5d', 'f1', 'g3', 'g4', 'h1', 'i3', 'i3en', 'inf1', 'm5', 'm5a', 'm5ad', 'm5d', 'm5dn', 'm5n', 'm6g', 'p2', 'p3', 'r4', 'r5', 'r5a', 'r5ad', 'r5d', 'r5dn', 'r5n', 't3', 't3a', 'u-6tb1', 'u-9tb1', 'u-12tb1', 'u-18tn1', 'u-24tb1', 'x1', 'x1e', 'z1d']:
if instance_class in ['a1', 'c5', 'c5a', 'c5d', 'c5n', 'c6g', 'c6gd', 'f1', 'g3', 'g4', 'h1', 'i3', 'i3en', 'inf1', 'm5', 'm5a', 'm5ad', 'm5d', 'm5dn', 'm5n', 'm6g', 'm6gd', 'p2', 'p3', 'r4', 'r5', 'r5a', 'r5ad', 'r5d', 'r5dn', 'r5n', 't3', 't3a', 'u-6tb1', 'u-9tb1', 'u-12tb1', 'u-18tn1', 'u-24tb1', 'x1', 'x1e', 'z1d']:
return 'ena'
if instance_class == 'm4':
if instance_size == '16xlarge':

View File

@@ -37,6 +37,7 @@ override_dh_strip:
# The binaries (ethtool...patchelf) don't pass dh_strip after going through patchelf. Since they are
# already stripped, nothing is lost if we exclude them, so that's what we do.
dh_strip -Xlibprotobuf.so.15 -Xld.so -Xethtool -Xgawk -Xgzip -Xhwloc-calc -Xhwloc-distrib -Xifconfig -Xlscpu -Xnetstat -Xpatchelf --dbg-package={{product}}-server-dbg
find $(CURDIR)/debian/{{product}}-server-dbg/usr/lib/debug/.build-id/ -name "*.debug" -exec objcopy --decompress-debug-sections {} \;
override_dh_makeshlibs:

View File

@@ -487,6 +487,9 @@ public:
size_t buffer_size() const {
return _impl->buffer_size();
}
const circular_buffer<mutation_fragment>& buffer() const {
return _impl->buffer();
}
// Detach the internal buffer of the reader.
// Roughly equivalent to depleting it by calling pop_mutation_fragment()
// until is_buffer_empty() returns true.

View File

@@ -428,6 +428,7 @@ future<> gossiper::handle_shutdown_msg(inet_address from) {
return make_ready_future<>();
}
return seastar::async([this, from] {
auto permit = this->lock_endpoint(from).get0();
this->mark_as_shutdown(from);
});
}

View File

@@ -126,6 +126,7 @@ relocate_python3() {
cp "$script" "$relocateddir"
cat > "$install"<<EOF
#!/usr/bin/env bash
export LC_ALL=en_US.UTF-8
x="\$(readlink -f "\$0")"
b="\$(basename "\$x")"
d="\$(dirname "\$x")"

View File

@@ -144,10 +144,33 @@ insert_token_range_to_sorted_container_while_unwrapping(
dht::token_range_vector
abstract_replication_strategy::get_ranges(inet_address ep) const {
return do_get_ranges(ep, _token_metadata, false);
}
dht::token_range_vector
abstract_replication_strategy::get_ranges_in_thread(inet_address ep) const {
return do_get_ranges(ep, _token_metadata, true);
}
dht::token_range_vector
abstract_replication_strategy::get_ranges(inet_address ep, token_metadata& tm) const {
return do_get_ranges(ep, tm, false);
}
dht::token_range_vector
abstract_replication_strategy::get_ranges_in_thread(inet_address ep, token_metadata& tm) const {
return do_get_ranges(ep, tm, true);
}
dht::token_range_vector
abstract_replication_strategy::do_get_ranges(inet_address ep, token_metadata& tm, bool can_yield) const {
dht::token_range_vector ret;
auto prev_tok = _token_metadata.sorted_tokens().back();
for (auto tok : _token_metadata.sorted_tokens()) {
for (inet_address a : calculate_natural_endpoints(tok, _token_metadata)) {
auto prev_tok = tm.sorted_tokens().back();
for (auto tok : tm.sorted_tokens()) {
for (inet_address a : calculate_natural_endpoints(tok, tm)) {
if (can_yield) {
seastar::thread::maybe_yield();
}
if (a == ep) {
insert_token_range_to_sorted_container_while_unwrapping(prev_tok, tok, ret);
break;

View File

@@ -106,6 +106,15 @@ public:
// It the analogue of Origin's getAddressRanges().get(endpoint).
// This function is not efficient, and not meant for the fast path.
dht::token_range_vector get_ranges(inet_address ep) const;
dht::token_range_vector get_ranges_in_thread(inet_address ep) const;
// Use the token_metadata provided by the caller instead of _token_metadata
dht::token_range_vector get_ranges(inet_address ep, token_metadata& tm) const;
dht::token_range_vector get_ranges_in_thread(inet_address ep, token_metadata& tm) const;
private:
dht::token_range_vector do_get_ranges(inet_address ep, token_metadata& tm, bool can_yield) const;
public:
// get_primary_ranges() returns the list of "primary ranges" for the given
// endpoint. "Primary ranges" are the ranges that the node is responsible
// for storing replica primarily, which means this is the first node

13
main.cc
View File

@@ -947,12 +947,16 @@ int main(int ac, char** av) {
mm.init_messaging_service();
}).get();
supervisor::notify("initializing storage proxy RPC verbs");
proxy.invoke_on_all([] (service::storage_proxy& p) {
p.init_messaging_service();
}).get();
proxy.invoke_on_all(&service::storage_proxy::init_messaging_service).get();
auto stop_proxy_handlers = defer_verbose_shutdown("storage proxy RPC verbs", [&proxy] {
proxy.invoke_on_all(&service::storage_proxy::uninit_messaging_service).get();
});
supervisor::notify("starting streaming service");
streaming::stream_session::init_streaming_service(db, sys_dist_ks, view_update_generator).get();
auto stop_streaming_service = defer_verbose_shutdown("streaming service", [] {
streaming::stream_session::uninit_streaming_service().get();
});
api::set_server_stream_manager(ctx).get();
supervisor::notify("starting hinted handoff manager");
@@ -985,6 +989,9 @@ int main(int ac, char** av) {
rs.stop().get();
});
repair_init_messaging_service_handler(rs, sys_dist_ks, view_update_generator).get();
auto stop_repair_messages = defer_verbose_shutdown("repair message handlers", [] {
repair_uninit_messaging_service_handler().get();
});
supervisor::notify("starting storage service", true);
auto& ss = service::get_local_storage_service();
ss.init_messaging_service_part().get();

View File

@@ -718,6 +718,10 @@ void messaging_service::register_stream_mutation_fragments(std::function<future<
register_handler(this, messaging_verb::STREAM_MUTATION_FRAGMENTS, std::move(func));
}
future<> messaging_service::unregister_stream_mutation_fragments() {
return unregister_handler(messaging_verb::STREAM_MUTATION_FRAGMENTS);
}
template<class SinkType, class SourceType>
future<rpc::sink<SinkType>, rpc::source<SourceType>>
do_make_sink_source(messaging_verb verb, uint32_t repair_meta_id, shared_ptr<messaging_service::rpc_protocol_client_wrapper> rpc_client, std::unique_ptr<messaging_service::rpc_protocol_wrapper>& rpc) {
@@ -749,6 +753,9 @@ rpc::sink<repair_row_on_wire_with_cmd> messaging_service::make_sink_for_repair_g
void messaging_service::register_repair_get_row_diff_with_rpc_stream(std::function<future<rpc::sink<repair_row_on_wire_with_cmd>> (const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::source<repair_hash_with_cmd> source)>&& func) {
register_handler(this, messaging_verb::REPAIR_GET_ROW_DIFF_WITH_RPC_STREAM, std::move(func));
}
future<> messaging_service::unregister_repair_get_row_diff_with_rpc_stream() {
return unregister_handler(messaging_verb::REPAIR_GET_ROW_DIFF_WITH_RPC_STREAM);
}
// Wrapper for REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM
future<rpc::sink<repair_row_on_wire_with_cmd>, rpc::source<repair_stream_cmd>>
@@ -768,6 +775,9 @@ rpc::sink<repair_stream_cmd> messaging_service::make_sink_for_repair_put_row_dif
void messaging_service::register_repair_put_row_diff_with_rpc_stream(std::function<future<rpc::sink<repair_stream_cmd>> (const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::source<repair_row_on_wire_with_cmd> source)>&& func) {
register_handler(this, messaging_verb::REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM, std::move(func));
}
future<> messaging_service::unregister_repair_put_row_diff_with_rpc_stream() {
return unregister_handler(messaging_verb::REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM);
}
// Wrapper for REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM
future<rpc::sink<repair_stream_cmd>, rpc::source<repair_hash_with_cmd>>
@@ -787,6 +797,9 @@ rpc::sink<repair_hash_with_cmd> messaging_service::make_sink_for_repair_get_full
void messaging_service::register_repair_get_full_row_hashes_with_rpc_stream(std::function<future<rpc::sink<repair_hash_with_cmd>> (const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::source<repair_stream_cmd> source)>&& func) {
register_handler(this, messaging_verb::REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM, std::move(func));
}
future<> messaging_service::unregister_repair_get_full_row_hashes_with_rpc_stream() {
return unregister_handler(messaging_verb::REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM);
}
// Send a message for verb
template <typename MsgIn, typename... MsgOut>
@@ -870,6 +883,9 @@ future<streaming::prepare_message> messaging_service::send_prepare_message(msg_a
return send_message<streaming::prepare_message>(this, messaging_verb::PREPARE_MESSAGE, id,
std::move(msg), plan_id, std::move(description), reason);
}
future<> messaging_service::unregister_prepare_message() {
return unregister_handler(messaging_verb::PREPARE_MESSAGE);
}
// PREPARE_DONE_MESSAGE
void messaging_service::register_prepare_done_message(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, unsigned dst_cpu_id)>&& func) {
@@ -879,6 +895,9 @@ future<> messaging_service::send_prepare_done_message(msg_addr id, UUID plan_id,
return send_message<void>(this, messaging_verb::PREPARE_DONE_MESSAGE, id,
plan_id, dst_cpu_id);
}
future<> messaging_service::unregister_prepare_done_message() {
return unregister_handler(messaging_verb::PREPARE_DONE_MESSAGE);
}
// STREAM_MUTATION
void messaging_service::register_stream_mutation(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, frozen_mutation fm, unsigned dst_cpu_id, rpc::optional<bool> fragmented, rpc::optional<streaming::stream_reason> reason)>&& func) {
@@ -903,6 +922,9 @@ future<> messaging_service::send_stream_mutation_done(msg_addr id, UUID plan_id,
return send_message<void>(this, messaging_verb::STREAM_MUTATION_DONE, id,
plan_id, std::move(ranges), cf_id, dst_cpu_id);
}
future<> messaging_service::unregister_stream_mutation_done() {
return unregister_handler(messaging_verb::STREAM_MUTATION_DONE);
}
// COMPLETE_MESSAGE
void messaging_service::register_complete_message(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, unsigned dst_cpu_id, rpc::optional<bool> failed)>&& func) {
@@ -912,6 +934,9 @@ future<> messaging_service::send_complete_message(msg_addr id, UUID plan_id, uns
return send_message<void>(this, messaging_verb::COMPLETE_MESSAGE, id,
plan_id, dst_cpu_id, failed);
}
future<> messaging_service::unregister_complete_message() {
return unregister_handler(messaging_verb::COMPLETE_MESSAGE);
}
void messaging_service::register_gossip_echo(std::function<future<> ()>&& func) {
register_handler(this, messaging_verb::GOSSIP_ECHO, std::move(func));

View File

@@ -275,10 +275,12 @@ public:
streaming::prepare_message msg, UUID plan_id, sstring description, rpc::optional<streaming::stream_reason> reason)>&& func);
future<streaming::prepare_message> send_prepare_message(msg_addr id, streaming::prepare_message msg, UUID plan_id,
sstring description, streaming::stream_reason);
future<> unregister_prepare_message();
// Wrapper for PREPARE_DONE_MESSAGE verb
void register_prepare_done_message(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, unsigned dst_cpu_id)>&& func);
future<> send_prepare_done_message(msg_addr id, UUID plan_id, unsigned dst_cpu_id);
future<> unregister_prepare_done_message();
// Wrapper for STREAM_MUTATION verb
void register_stream_mutation(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, frozen_mutation fm, unsigned dst_cpu_id, rpc::optional<bool>, rpc::optional<streaming::stream_reason>)>&& func);
@@ -287,6 +289,7 @@ public:
// Wrapper for STREAM_MUTATION_FRAGMENTS
// The receiver of STREAM_MUTATION_FRAGMENTS sends status code to the sender to notify any error on the receiver side. The status code is of type int32_t. 0 means successful, -1 means error, other status code value are reserved for future use.
void register_stream_mutation_fragments(std::function<future<rpc::sink<int32_t>> (const rpc::client_info& cinfo, UUID plan_id, UUID schema_id, UUID cf_id, uint64_t estimated_partitions, rpc::optional<streaming::stream_reason> reason_opt, rpc::source<frozen_mutation_fragment, rpc::optional<streaming::stream_mutation_fragments_cmd>> source)>&& func);
future<> unregister_stream_mutation_fragments();
rpc::sink<int32_t> make_sink_for_stream_mutation_fragments(rpc::source<frozen_mutation_fragment, rpc::optional<streaming::stream_mutation_fragments_cmd>>& source);
future<rpc::sink<frozen_mutation_fragment, streaming::stream_mutation_fragments_cmd>, rpc::source<int32_t>> make_sink_and_source_for_stream_mutation_fragments(utils::UUID schema_id, utils::UUID plan_id, utils::UUID cf_id, uint64_t estimated_partitions, streaming::stream_reason reason, msg_addr id);
@@ -294,22 +297,27 @@ public:
future<rpc::sink<repair_hash_with_cmd>, rpc::source<repair_row_on_wire_with_cmd>> make_sink_and_source_for_repair_get_row_diff_with_rpc_stream(uint32_t repair_meta_id, msg_addr id);
rpc::sink<repair_row_on_wire_with_cmd> make_sink_for_repair_get_row_diff_with_rpc_stream(rpc::source<repair_hash_with_cmd>& source);
void register_repair_get_row_diff_with_rpc_stream(std::function<future<rpc::sink<repair_row_on_wire_with_cmd>> (const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::source<repair_hash_with_cmd> source)>&& func);
future<> unregister_repair_get_row_diff_with_rpc_stream();
// Wrapper for REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM
future<rpc::sink<repair_row_on_wire_with_cmd>, rpc::source<repair_stream_cmd>> make_sink_and_source_for_repair_put_row_diff_with_rpc_stream(uint32_t repair_meta_id, msg_addr id);
rpc::sink<repair_stream_cmd> make_sink_for_repair_put_row_diff_with_rpc_stream(rpc::source<repair_row_on_wire_with_cmd>& source);
void register_repair_put_row_diff_with_rpc_stream(std::function<future<rpc::sink<repair_stream_cmd>> (const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::source<repair_row_on_wire_with_cmd> source)>&& func);
future<> unregister_repair_put_row_diff_with_rpc_stream();
// Wrapper for REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM
future<rpc::sink<repair_stream_cmd>, rpc::source<repair_hash_with_cmd>> make_sink_and_source_for_repair_get_full_row_hashes_with_rpc_stream(uint32_t repair_meta_id, msg_addr id);
rpc::sink<repair_hash_with_cmd> make_sink_for_repair_get_full_row_hashes_with_rpc_stream(rpc::source<repair_stream_cmd>& source);
void register_repair_get_full_row_hashes_with_rpc_stream(std::function<future<rpc::sink<repair_hash_with_cmd>> (const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::source<repair_stream_cmd> source)>&& func);
future<> unregister_repair_get_full_row_hashes_with_rpc_stream();
void register_stream_mutation_done(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, dht::token_range_vector ranges, UUID cf_id, unsigned dst_cpu_id)>&& func);
future<> send_stream_mutation_done(msg_addr id, UUID plan_id, dht::token_range_vector ranges, UUID cf_id, unsigned dst_cpu_id);
future<> unregister_stream_mutation_done();
void register_complete_message(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, unsigned dst_cpu_id, rpc::optional<bool> failed)>&& func);
future<> send_complete_message(msg_addr id, UUID plan_id, unsigned dst_cpu_id, bool failed = false);
future<> unregister_complete_message();
// Wrapper for REPAIR_CHECKSUM_RANGE verb
void register_repair_checksum_range(std::function<future<partition_checksum> (sstring keyspace, sstring cf, dht::token_range range, rpc::optional<repair_checksum> hash_version)>&& func);

View File

@@ -195,6 +195,7 @@ class read_context : public reader_lifecycle_policy {
// One for each shard. Index is shard id.
std::vector<reader_meta> _readers;
std::vector<reader_concurrency_semaphore*> _semaphores;
gate _dismantling_gate;
@@ -211,7 +212,8 @@ public:
, _schema(std::move(s))
, _cmd(cmd)
, _ranges(ranges)
, _trace_state(std::move(trace_state)) {
, _trace_state(std::move(trace_state))
, _semaphores(smp::count, nullptr) {
_readers.resize(smp::count);
}
@@ -236,7 +238,12 @@ public:
virtual void destroy_reader(shard_id shard, future<stopped_reader> reader_fut) noexcept override;
virtual reader_concurrency_semaphore& semaphore() override {
return _readers[engine().cpu_id()].rparts->semaphore;
const auto shard = engine().cpu_id();
if (!_semaphores[shard]) {
auto& table = _db.local().find_column_family(_schema);
_semaphores[shard] = &table.read_concurrency_semaphore();
}
return *_semaphores[shard];
}
future<> lookup_readers();

View File

@@ -1721,7 +1721,7 @@ void row::apply_monotonically(const schema& s, column_kind kind, row&& other) {
// we erase the live cells according to the shadowable_tombstone rules.
static bool dead_marker_shadows_row(const schema& s, column_kind kind, const row_marker& marker) {
return s.is_view()
&& !s.view_info()->base_non_pk_columns_in_view_pk().empty()
&& s.view_info()->has_base_non_pk_columns_in_view_pk()
&& !marker.is_live()
&& kind == column_kind::regular_column; // not applicable to static rows
}

View File

@@ -113,9 +113,6 @@ class reconcilable_result_builder {
const schema& _schema;
const query::partition_slice& _slice;
utils::chunked_vector<partition> _result;
uint32_t _live_rows{};
bool _return_static_content_on_partition_with_no_rows{};
bool _static_row_is_alive{};
uint32_t _total_live_rows = 0;
@@ -123,6 +120,10 @@ class reconcilable_result_builder {
stop_iteration _stop;
bool _short_read_allowed;
std::optional<streamed_mutation_freezer> _mutation_consumer;
uint32_t _live_rows{};
// make this the last member so it is destroyed first. #7240
utils::chunked_vector<partition> _result;
public:
reconcilable_result_builder(const schema& s, const query::partition_slice& slice,
query::result_memory_accounter&& accounter)

File diff suppressed because it is too large Load Diff

View File

@@ -372,6 +372,64 @@ flat_mutation_reader make_foreign_reader(schema_ptr schema,
foreign_ptr<std::unique_ptr<flat_mutation_reader>> reader,
streamed_mutation::forwarding fwd_sm = streamed_mutation::forwarding::no);
/// Make an auto-paused evictable reader.
///
/// The reader is paused after each use, that is after each call to any of its
/// members that cause actual reading to be done (`fill_buffer()` and
/// `fast_forward_to()`). When paused, the reader is made evictable, that it is
/// it is registered with reader concurrency semaphore as an inactive read.
/// The reader is resumed automatically on the next use. If it was evicted, it
/// will be recreated at the position it left off reading. This is all
/// transparent to its user.
/// Parameters passed by reference have to be kept alive while the reader is
/// alive.
flat_mutation_reader make_auto_paused_evictable_reader(
mutation_source ms,
schema_ptr schema,
reader_concurrency_semaphore& semaphore,
const dht::partition_range& pr,
const query::partition_slice& ps,
const io_priority_class& pc,
tracing::trace_state_ptr trace_state,
mutation_reader::forwarding fwd_mr);
class evictable_reader;
class evictable_reader_handle {
friend std::pair<flat_mutation_reader, evictable_reader_handle> make_manually_paused_evictable_reader(mutation_source, schema_ptr, reader_concurrency_semaphore&,
const dht::partition_range&, const query::partition_slice&, const io_priority_class&, tracing::trace_state_ptr, mutation_reader::forwarding);
private:
evictable_reader* _r;
private:
explicit evictable_reader_handle(evictable_reader& r);
public:
void pause();
};
/// Make a manually-paused evictable reader.
///
/// The reader can be paused via the evictable reader handle when desired. The
/// intended usage is subsequent reads done in bursts, after which the reader is
/// not used for some time. When paused, the reader is made evictable, that is,
/// it is registered with reader concurrency semaphore as an inactive read.
/// The reader is resumed automatically on the next use. If it was evicted, it
/// will be recreated at the position it left off reading. This is all
/// transparent to its user.
/// Parameters passed by reference have to be kept alive while the reader is
/// alive.
std::pair<flat_mutation_reader, evictable_reader_handle> make_manually_paused_evictable_reader(
mutation_source ms,
schema_ptr schema,
reader_concurrency_semaphore& semaphore,
const dht::partition_range& pr,
const query::partition_slice& ps,
const io_priority_class& pc,
tracing::trace_state_ptr trace_state,
mutation_reader::forwarding fwd_mr);
/// Reader lifecycle policy for the mulitshard combining reader.
///
/// This policy is expected to make sure any additional resource the readers

View File

@@ -163,6 +163,11 @@ public:
return {partition_region::clustered, bound_weight::before_all_prefixed, &ck};
}
// Returns a view to before_key(pos._ck) if pos.is_clustering_row() else returns pos as-is.
static position_in_partition_view before_key(position_in_partition_view pos) {
return {partition_region::clustered, pos._bound_weight == bound_weight::equal ? bound_weight::before_all_prefixed : pos._bound_weight, pos._ck};
}
partition_region region() const { return _type; }
bound_weight get_bound_weight() const { return _bound_weight; }
bool is_partition_start() const { return _type == partition_region::partition_start; }

View File

@@ -27,6 +27,7 @@
reader_permit::impl::impl(reader_concurrency_semaphore& semaphore, reader_resources base_cost) : semaphore(semaphore), base_cost(base_cost) {
semaphore.consume(base_cost);
}
reader_permit::impl::~impl() {
@@ -88,7 +89,6 @@ void reader_concurrency_semaphore::signal(const resources& r) noexcept {
_resources += r;
while (!_wait_list.empty() && has_available_units(_wait_list.front().res)) {
auto& x = _wait_list.front();
_resources -= x.res;
try {
x.pr.set_value(reader_permit(*this, x.res));
} catch (...) {
@@ -160,7 +160,6 @@ future<reader_permit> reader_concurrency_semaphore::wait_admission(size_t memory
--_inactive_read_stats.population;
}
if (may_proceed(r)) {
_resources -= r;
return make_ready_future<reader_permit>(reader_permit(*this, r));
}
promise<reader_permit> pr;
@@ -170,7 +169,6 @@ future<reader_permit> reader_concurrency_semaphore::wait_admission(size_t memory
}
reader_permit reader_concurrency_semaphore::consume_resources(resources r) {
_resources -= r;
return reader_permit(*this, r);
}

View File

@@ -128,6 +128,10 @@ private:
return has_available_units(r) && _wait_list.empty();
}
void consume(resources r) {
_resources -= r;
}
void consume_memory(size_t memory) {
_resources.memory -= memory;
}

View File

@@ -47,6 +47,7 @@
#include "gms/gossiper.hh"
#include "repair/row_level.hh"
#include "mutation_source_metadata.hh"
#include "utils/stall_free.hh"
extern logging::logger rlogger;
@@ -373,6 +374,7 @@ private:
std::optional<utils::phased_barrier::operation> _local_read_op;
// Local reader or multishard reader to read the range
flat_mutation_reader _reader;
std::optional<evictable_reader_handle> _reader_handle;
// Current partition read from disk
lw_shared_ptr<const decorated_key_with_hash> _current_dk;
@@ -392,32 +394,49 @@ public:
, _sharder(remote_partitioner, range, remote_shard)
, _seed(seed)
, _local_read_op(local_reader ? std::optional(cf.read_in_progress()) : std::nullopt)
, _reader(make_reader(db, cf, local_reader)) {
}
private:
flat_mutation_reader
make_reader(seastar::sharded<database>& db,
column_family& cf,
is_local_reader local_reader) {
, _reader(nullptr) {
if (local_reader) {
return cf.make_streaming_reader(_schema, _range);
auto ms = mutation_source([&cf] (
schema_ptr s,
reader_permit,
const dht::partition_range& pr,
const query::partition_slice& ps,
const io_priority_class& pc,
tracing::trace_state_ptr,
streamed_mutation::forwarding,
mutation_reader::forwarding fwd_mr) {
return cf.make_streaming_reader(std::move(s), pr, ps, fwd_mr);
});
std::tie(_reader, _reader_handle) = make_manually_paused_evictable_reader(
std::move(ms),
_schema,
cf.streaming_read_concurrency_semaphore(),
_range,
_schema->full_slice(),
service::get_local_streaming_read_priority(),
{},
mutation_reader::forwarding::no);
} else {
_reader = make_multishard_streaming_reader(db, _schema, [this] {
auto shard_range = _sharder.next();
if (shard_range) {
return std::optional<dht::partition_range>(dht::to_partition_range(*shard_range));
}
return std::optional<dht::partition_range>();
});
}
return make_multishard_streaming_reader(db, _schema, [this] {
auto shard_range = _sharder.next();
if (shard_range) {
return std::optional<dht::partition_range>(dht::to_partition_range(*shard_range));
}
return std::optional<dht::partition_range>();
});
}
public:
future<mutation_fragment_opt>
read_mutation_fragment() {
return _reader(db::no_timeout);
}
void on_end_of_stream() {
_reader = make_empty_flat_reader(_schema);
_reader_handle.reset();
}
lw_shared_ptr<const decorated_key_with_hash>& get_current_dk() {
return _current_dk;
}
@@ -436,6 +455,11 @@ public:
}
}
void pause() {
if (_reader_handle) {
_reader_handle->pause();
}
}
};
class repair_writer {
@@ -1019,11 +1043,7 @@ private:
return repair_hash(h.finalize_uint64());
}
stop_iteration handle_mutation_fragment(mutation_fragment_opt mfopt, size_t& cur_size, size_t& new_rows_size, std::list<repair_row>& cur_rows) {
if (!mfopt) {
return stop_iteration::yes;
}
mutation_fragment& mf = *mfopt;
stop_iteration handle_mutation_fragment(mutation_fragment& mf, size_t& cur_size, size_t& new_rows_size, std::list<repair_row>& cur_rows) {
if (mf.is_partition_start()) {
auto& start = mf.as_partition_start();
_repair_reader.set_current_dk(start.key());
@@ -1058,32 +1078,49 @@ private:
}
_gate.check();
return _repair_reader.read_mutation_fragment().then([this, &cur_size, &new_rows_size, &cur_rows] (mutation_fragment_opt mfopt) mutable {
return handle_mutation_fragment(std::move(mfopt), cur_size, new_rows_size, cur_rows);
if (!mfopt) {
_repair_reader.on_end_of_stream();
return stop_iteration::yes;
}
return handle_mutation_fragment(*mfopt, cur_size, new_rows_size, cur_rows);
});
}).then([&cur_rows, &new_rows_size] () mutable {
}).then_wrapped([this, &cur_rows, &new_rows_size] (future<> fut) mutable {
if (fut.failed()) {
_repair_reader.on_end_of_stream();
return make_exception_future<std::list<repair_row>, size_t>(fut.get_exception());
}
_repair_reader.pause();
return make_ready_future<std::list<repair_row>, size_t>(std::move(cur_rows), new_rows_size);
});
});
}
future<> clear_row_buf() {
return utils::clear_gently(_row_buf);
}
future<> clear_working_row_buf() {
return utils::clear_gently(_working_row_buf).then([this] {
_working_row_buf_combined_hash.clear();
});
}
// Read rows from disk until _max_row_buf_size of rows are filled into _row_buf.
// Calculate the combined checksum of the rows
// Calculate the total size of the rows in _row_buf
future<get_sync_boundary_response>
get_sync_boundary(std::optional<repair_sync_boundary> skipped_sync_boundary) {
auto f = make_ready_future<>();
if (skipped_sync_boundary) {
_current_sync_boundary = skipped_sync_boundary;
_row_buf.clear();
_working_row_buf.clear();
_working_row_buf_combined_hash.clear();
} else {
_working_row_buf.clear();
_working_row_buf_combined_hash.clear();
f = clear_row_buf();
}
// Here is the place we update _last_sync_boundary
rlogger.trace("SET _last_sync_boundary from {} to {}", _last_sync_boundary, _current_sync_boundary);
_last_sync_boundary = _current_sync_boundary;
return row_buf_size().then([this, sb = std::move(skipped_sync_boundary)] (size_t cur_size) {
return f.then([this, sb = std::move(skipped_sync_boundary)] () mutable {
return clear_working_row_buf().then([this, sb = sb] () mutable {
return row_buf_size().then([this, sb = std::move(sb)] (size_t cur_size) {
return read_rows_from_disk(cur_size).then([this, sb = std::move(sb)] (std::list<repair_row> new_rows, size_t new_rows_size) mutable {
size_t new_rows_nr = new_rows.size();
_row_buf.splice(_row_buf.end(), new_rows);
@@ -1100,6 +1137,8 @@ private:
});
});
});
});
});
}
future<> move_row_buf_to_working_row_buf() {
@@ -1203,19 +1242,28 @@ private:
}
}
future<> do_apply_rows(std::list<repair_row>& row_diff, unsigned node_idx, update_working_row_buf update_buf) {
return with_semaphore(_repair_writer.sem(), 1, [this, node_idx, update_buf, &row_diff] {
_repair_writer.create_writer(_db, node_idx);
return do_for_each(row_diff, [this, node_idx, update_buf] (repair_row& r) {
if (update_buf) {
_working_row_buf_combined_hash.add(r.hash());
}
// The repair_row here is supposed to have
// mutation_fragment attached because we have stored it in
// to_repair_rows_list above where the repair_row is created.
mutation_fragment mf = std::move(r.get_mutation_fragment());
auto dk_with_hash = r.get_dk_with_hash();
return _repair_writer.do_write(node_idx, std::move(dk_with_hash), std::move(mf));
future<> do_apply_rows(std::list<repair_row>&& row_diff, unsigned node_idx, update_working_row_buf update_buf) {
return do_with(std::move(row_diff), [this, node_idx, update_buf] (std::list<repair_row>& row_diff) {
return with_semaphore(_repair_writer.sem(), 1, [this, node_idx, update_buf, &row_diff] {
_repair_writer.create_writer(_db, node_idx);
return repeat([this, node_idx, update_buf, &row_diff] () mutable {
if (row_diff.empty()) {
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
repair_row& r = row_diff.front();
if (update_buf) {
_working_row_buf_combined_hash.add(r.hash());
}
// The repair_row here is supposed to have
// mutation_fragment attached because we have stored it in
// to_repair_rows_list above where the repair_row is created.
mutation_fragment mf = std::move(r.get_mutation_fragment());
auto dk_with_hash = r.get_dk_with_hash();
return _repair_writer.do_write(node_idx, std::move(dk_with_hash), std::move(mf)).then([&row_diff] {
row_diff.pop_front();
return make_ready_future<stop_iteration>(stop_iteration::no);
});
});
});
});
}
@@ -1233,19 +1281,17 @@ private:
stats().rx_row_nr += row_diff.size();
stats().rx_row_nr_peer[from] += row_diff.size();
if (update_buf) {
std::list<repair_row> tmp;
tmp.swap(_working_row_buf);
// Both row_diff and _working_row_buf and are ordered, merging
// two sored list to make sure the combination of row_diff
// and _working_row_buf are ordered.
std::merge(tmp.begin(), tmp.end(), row_diff.begin(), row_diff.end(), std::back_inserter(_working_row_buf),
[this] (const repair_row& x, const repair_row& y) { thread::maybe_yield(); return _cmp(x.boundary(), y.boundary()) < 0; });
utils::merge_to_gently(_working_row_buf, row_diff,
[this] (const repair_row& x, const repair_row& y) { return _cmp(x.boundary(), y.boundary()) < 0; });
}
if (update_hash_set) {
_peer_row_hash_sets[node_idx] = boost::copy_range<repair_hash_set>(row_diff |
boost::adaptors::transformed([] (repair_row& r) { thread::maybe_yield(); return r.hash(); }));
}
do_apply_rows(row_diff, node_idx, update_buf).get();
do_apply_rows(std::move(row_diff), node_idx, update_buf).get();
}
future<>
@@ -1253,11 +1299,9 @@ private:
if (rows.empty()) {
return make_ready_future<>();
}
return to_repair_rows_list(rows).then([this] (std::list<repair_row> row_diff) {
return do_with(std::move(row_diff), [this] (std::list<repair_row>& row_diff) {
unsigned node_idx = 0;
return do_apply_rows(row_diff, node_idx, update_working_row_buf::no);
});
return to_repair_rows_list(std::move(rows)).then([this] (std::list<repair_row> row_diff) {
unsigned node_idx = 0;
return do_apply_rows(std::move(row_diff), node_idx, update_working_row_buf::no);
});
}
@@ -2168,6 +2212,25 @@ future<> repair_init_messaging_service_handler(repair_service& rs, distributed<d
});
}
future<> repair_uninit_messaging_service_handler() {
return netw::get_messaging_service().invoke_on_all([] (auto& ms) {
return when_all_succeed(
ms.unregister_repair_get_row_diff_with_rpc_stream(),
ms.unregister_repair_put_row_diff_with_rpc_stream(),
ms.unregister_repair_get_full_row_hashes_with_rpc_stream(),
ms.unregister_repair_get_full_row_hashes(),
ms.unregister_repair_get_combined_row_hash(),
ms.unregister_repair_get_sync_boundary(),
ms.unregister_repair_get_row_diff(),
ms.unregister_repair_put_row_diff(),
ms.unregister_repair_row_level_start(),
ms.unregister_repair_row_level_stop(),
ms.unregister_repair_get_estimated_partitions(),
ms.unregister_repair_set_estimated_partitions(),
ms.unregister_repair_get_diff_algorithms()).discard_result();
});
}
class row_level_repair {
repair_info& _ri;
sstring _cf_name;

View File

@@ -45,6 +45,7 @@ private:
};
future<> repair_init_messaging_service_handler(repair_service& rs, distributed<db::system_distributed_keyspace>& sys_dist_ks, distributed<db::view::view_update_generator>& view_update_generator);
future<> repair_uninit_messaging_service_handler();
class repair_info;

View File

@@ -42,6 +42,8 @@
constexpr int32_t schema::NAME_LENGTH;
extern logging::logger dblog;
sstring to_sstring(column_kind k) {
switch (k) {
case column_kind::partition_key: return "PARTITION_KEY";
@@ -575,11 +577,15 @@ schema::get_column_definition(const bytes& name) const {
const column_definition&
schema::column_at(column_kind kind, column_id id) const {
return _raw._columns.at(column_offset(kind) + id);
return column_at(static_cast<ordinal_column_id>(column_offset(kind) + id));
}
const column_definition&
schema::column_at(ordinal_column_id ordinal_id) const {
if (size_t(ordinal_id) >= _raw._columns.size()) {
on_internal_error(dblog, format("{}.{}@{}: column id {:d} >= {:d}",
ks_name(), cf_name(), version(), size_t(ordinal_id), _raw._columns.size()));
}
return _raw._columns.at(static_cast<column_count_type>(ordinal_id));
}

View File

@@ -596,7 +596,7 @@ def current_shard():
def find_db(shard=None):
if not shard:
if shard is None:
shard = current_shard()
return gdb.parse_and_eval('::debug::db')['_instances']['_M_impl']['_M_start'][shard]['service']['_p']

Submodule seastar updated: 861b7edd61...748428930a

View File

@@ -92,7 +92,7 @@ void migration_manager::init_messaging_service()
//FIXME: future discarded.
(void)with_gate(_background_tasks, [this] {
mlogger.debug("features changed, recalculating schema version");
return update_schema_version_and_announce(get_storage_proxy(), _feat.cluster_schema_features());
return db::schema_tables::recalculate_schema_version(get_storage_proxy(), _feat);
});
};
@@ -277,9 +277,9 @@ future<> migration_manager::maybe_schedule_schema_pull(const utils::UUID& their_
}).finally([me = shared_from_this()] {});
}
future<> migration_manager::submit_migration_task(const gms::inet_address& endpoint)
future<> migration_manager::submit_migration_task(const gms::inet_address& endpoint, bool can_ignore_down_node)
{
return service::migration_task::run_may_throw(endpoint);
return service::migration_task::run_may_throw(endpoint, can_ignore_down_node);
}
future<> migration_manager::do_merge_schema_from(netw::messaging_service::msg_addr id)
@@ -1132,7 +1132,8 @@ future<> migration_manager::sync_schema(const database& db, const std::vector<gm
}).then([this, &schema_map] {
return parallel_for_each(schema_map, [this] (auto& x) {
mlogger.debug("Pulling schema {} from {}", x.first, x.second.front());
return submit_migration_task(x.second.front());
bool can_ignore_down_node = false;
return submit_migration_task(x.second.front(), can_ignore_down_node);
});
});
});

View File

@@ -82,7 +82,7 @@ public:
future<> maybe_schedule_schema_pull(const utils::UUID& their_version, const gms::inet_address& endpoint);
future<> submit_migration_task(const gms::inet_address& endpoint);
future<> submit_migration_task(const gms::inet_address& endpoint, bool can_ignore_down_node = true);
// Makes sure that this node knows about all schema changes known by "nodes" that were made prior to this call.
future<> sync_schema(const database& db, const std::vector<gms::inet_address>& nodes);

View File

@@ -51,11 +51,12 @@ namespace service {
static logging::logger mlogger("migration_task");
future<> migration_task::run_may_throw(const gms::inet_address& endpoint)
future<> migration_task::run_may_throw(const gms::inet_address& endpoint, bool can_ignore_down_node)
{
if (!gms::get_local_gossiper().is_alive(endpoint)) {
mlogger.warn("Can't send migration request: node {} is down.", endpoint);
return make_ready_future<>();
auto msg = format("Can't send migration request: node {} is down.", endpoint);
mlogger.warn("{}", msg);
return can_ignore_down_node ? make_ready_future<>() : make_exception_future<>(std::runtime_error(msg));
}
netw::messaging_service::msg_addr id{endpoint, 0};
return service::get_local_migration_manager().merge_schema_from(id).handle_exception([](std::exception_ptr e) {

View File

@@ -47,7 +47,7 @@ namespace service {
class migration_task {
public:
static future<> run_may_throw(const gms::inet_address& endpoint);
static future<> run_may_throw(const gms::inet_address& endpoint, bool can_ignore_down_node);
};
}

View File

@@ -334,6 +334,7 @@ public:
if (_handler) {
_handler->prune(_proposal->ballot);
}
_proposal.release();
}
};
@@ -5027,14 +5028,15 @@ void storage_proxy::init_messaging_service() {
}
pruning++;
auto d = defer([] { pruning--; });
return get_schema_for_read(schema_id, src_addr).then([this, key = std::move(key), ballot,
timeout, tr_state = std::move(tr_state), src_ip] (schema_ptr schema) mutable {
timeout, tr_state = std::move(tr_state), src_ip, d = std::move(d)] (schema_ptr schema) mutable {
dht::token token = dht::get_token(*schema, key);
unsigned shard = dht::shard_of(*schema, token);
bool local = shard == engine().cpu_id();
get_stats().replica_cross_shard_ops += !local;
return smp::submit_to(shard, _write_smp_service_group, [gs = global_schema_ptr(schema), gt = tracing::global_trace_state_ptr(std::move(tr_state)),
local, key = std::move(key), ballot, timeout, src_ip, d = defer([] { pruning--; })] () {
local, key = std::move(key), ballot, timeout, src_ip, d = std::move(d)] () {
tracing::trace_state_ptr tr_state = gt;
return paxos::paxos_state::prune(gs, key, ballot, *timeout, tr_state).then([src_ip, tr_state] () {
tracing::trace(tr_state, "paxos_prune: handling is done, sending a response to /{}", src_ip);
@@ -5048,18 +5050,22 @@ void storage_proxy::init_messaging_service() {
future<> storage_proxy::uninit_messaging_service() {
auto& ms = netw::get_local_messaging_service();
return when_all_succeed(
ms.unregister_counter_mutation(),
ms.unregister_mutation(),
ms.unregister_hint_mutation(),
ms.unregister_mutation_done(),
ms.unregister_mutation_failed(),
ms.unregister_read_data(),
ms.unregister_read_mutation_data(),
ms.unregister_read_digest(),
ms.unregister_truncate(),
ms.unregister_get_schema_version(),
ms.unregister_paxos_prepare(),
ms.unregister_paxos_accept(),
ms.unregister_paxos_learn(),
ms.unregister_paxos_prune()
);
}
future<rpc::tuple<foreign_ptr<lw_shared_ptr<reconcilable_result>>, cache_temperature>>
@@ -5152,8 +5158,7 @@ future<> storage_proxy::drain_on_shutdown() {
future<>
storage_proxy::stop() {
// FIXME: hints manager should be stopped here but it seems like this function is never called
return uninit_messaging_service();
return make_ready_future<>();
}
}

View File

@@ -298,7 +298,6 @@ private:
cdc::cdc_service* _cdc = nullptr;
cdc_stats _cdc_stats;
private:
future<> uninit_messaging_service();
future<coordinator_query_result> query_singular(lw_shared_ptr<query::read_command> cmd,
dht::partition_range_vector&& partition_ranges,
db::consistency_level cl,
@@ -452,6 +451,7 @@ public:
return next;
}
void init_messaging_service();
future<> uninit_messaging_service();
// Applies mutation on this node.
// Resolves with timed_out_error when timeout is reached.

View File

@@ -218,7 +218,7 @@ std::vector<sstables::shared_sstable> compaction_manager::get_candidates(const c
auto& cs = cf.get_compaction_strategy();
// Filter out sstables that are being compacted.
for (auto& sst : cf.candidates_for_compaction()) {
for (auto& sst : cf.non_staging_sstables()) {
if (_compacting_sstables.count(sst)) {
continue;
}
@@ -646,8 +646,8 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
return task->compaction_done.get_future().then([task] {});
}
static bool needs_cleanup(const sstables::shared_sstable& sst,
const dht::token_range_vector& owned_ranges,
bool needs_cleanup(const sstables::shared_sstable& sst,
const dht::token_range_vector& sorted_owned_ranges,
schema_ptr s) {
auto first = sst->get_first_partition_key();
auto last = sst->get_last_partition_key();
@@ -655,29 +655,40 @@ static bool needs_cleanup(const sstables::shared_sstable& sst,
auto last_token = dht::get_token(*s, last);
dht::token_range sst_token_range = dht::token_range::make(first_token, last_token);
auto r = std::lower_bound(sorted_owned_ranges.begin(), sorted_owned_ranges.end(), first_token,
[] (const range<dht::token>& a, const dht::token& b) {
// check that range a is before token b.
return a.after(b, dht::token_comparator());
});
// return true iff sst partition range isn't fully contained in any of the owned ranges.
for (auto& r : owned_ranges) {
if (r.contains(sst_token_range, dht::token_comparator())) {
if (r != sorted_owned_ranges.end()) {
if (r->contains(sst_token_range, dht::token_comparator())) {
return false;
}
}
return true;
}
future<> compaction_manager::perform_cleanup(column_family* cf) {
future<> compaction_manager::perform_cleanup(database& db, column_family* cf) {
if (check_for_cleanup(cf)) {
throw std::runtime_error(format("cleanup request failed: there is an ongoing cleanup on {}.{}",
cf->schema()->ks_name(), cf->schema()->cf_name()));
}
return rewrite_sstables(cf, sstables::compaction_options::make_cleanup(), [this] (const table& table) {
auto schema = table.schema();
auto owned_ranges = service::get_local_storage_service().get_local_ranges(schema->ks_name());
return seastar::async([this, cf, &db] {
auto schema = cf->schema();
auto& rs = db.find_keyspace(schema->ks_name()).get_replication_strategy();
auto sorted_owned_ranges = rs.get_ranges_in_thread(utils::fb_utilities::get_broadcast_address());
auto sstables = std::vector<sstables::shared_sstable>{};
const auto candidates = table.candidates_for_compaction();
std::copy_if(candidates.begin(), candidates.end(), std::back_inserter(sstables), [&owned_ranges, schema] (const sstables::shared_sstable& sst) {
return owned_ranges.empty() || needs_cleanup(sst, owned_ranges, schema);
const auto candidates = get_candidates(*cf);
std::copy_if(candidates.begin(), candidates.end(), std::back_inserter(sstables), [&sorted_owned_ranges, schema] (const sstables::shared_sstable& sst) {
seastar::thread::maybe_yield();
return sorted_owned_ranges.empty() || needs_cleanup(sst, sorted_owned_ranges, schema);
});
return sstables;
}).then([this, cf] (std::vector<sstables::shared_sstable> sstables) {
return rewrite_sstables(cf, sstables::compaction_options::make_cleanup(),
[sstables = std::move(sstables)] (const table&) { return sstables; });
});
}
@@ -692,7 +703,7 @@ future<> compaction_manager::perform_sstable_upgrade(column_family* cf, bool exc
return cf->run_with_compaction_disabled([this, cf, &tables, exclude_current_version] {
auto last_version = cf->get_sstables_manager().get_highest_supported_format();
for (auto& sst : cf->candidates_for_compaction()) {
for (auto& sst : get_candidates(*cf)) {
// if we are a "normal" upgrade, we only care about
// tables with older versions, but potentially
// we are to actually rewrite everything. (-a)
@@ -717,8 +728,8 @@ future<> compaction_manager::perform_sstable_upgrade(column_family* cf, bool exc
// Submit a column family to be scrubbed and wait for its termination.
future<> compaction_manager::perform_sstable_scrub(column_family* cf, bool skip_corrupted) {
return rewrite_sstables(cf, sstables::compaction_options::make_scrub(skip_corrupted), [] (const table& cf) {
return cf.candidates_for_compaction();
return rewrite_sstables(cf, sstables::compaction_options::make_scrub(skip_corrupted), [this] (const table& cf) {
return get_candidates(cf);
});
}

View File

@@ -175,7 +175,7 @@ public:
// Cleanup is about discarding keys that are no longer relevant for a
// given sstable, e.g. after node loses part of its token range because
// of a newly added node.
future<> perform_cleanup(column_family* cf);
future<> perform_cleanup(database& db, column_family* cf);
// Submit a column family to be upgraded and wait for its termination.
future<> perform_sstable_upgrade(column_family* cf, bool exclude_current_version);
@@ -243,3 +243,5 @@ public:
friend class compaction_weight_registration;
};
bool needs_cleanup(const sstables::shared_sstable& sst, const dht::token_range_vector& owned_ranges, schema_ptr s);

View File

@@ -319,6 +319,15 @@ void stream_session::init_messaging_service_handler() {
});
}
future<> stream_session::uninit_messaging_service_handler() {
return when_all_succeed(
ms().unregister_prepare_message(),
ms().unregister_prepare_done_message(),
ms().unregister_stream_mutation_fragments(),
ms().unregister_stream_mutation_done(),
ms().unregister_complete_message()).discard_result();
}
distributed<database>* stream_session::_db;
distributed<db::system_distributed_keyspace>* stream_session::_sys_dist_ks;
distributed<db::view::view_update_generator>* stream_session::_view_update_generator;
@@ -342,9 +351,13 @@ future<> stream_session::init_streaming_service(distributed<database>& db, distr
// });
return get_stream_manager().start().then([] {
gms::get_local_gossiper().register_(get_local_stream_manager().shared_from_this());
return _db->invoke_on_all([] (auto& db) {
init_messaging_service_handler();
});
return smp::invoke_on_all([] { init_messaging_service_handler(); });
});
}
future<> stream_session::uninit_streaming_service() {
return smp::invoke_on_all([] {
return uninit_messaging_service_handler();
});
}

View File

@@ -142,6 +142,7 @@ private:
using token = dht::token;
using ring_position = dht::ring_position;
static void init_messaging_service_handler();
static future<> uninit_messaging_service_handler();
static distributed<database>* _db;
static distributed<db::system_distributed_keyspace>* _sys_dist_ks;
static distributed<db::view::view_update_generator>* _view_update_generator;
@@ -152,6 +153,7 @@ public:
static database& get_local_db() { return _db->local(); }
static distributed<database>& get_db() { return *_db; };
static future<> init_streaming_service(distributed<database>& db, distributed<db::system_distributed_keyspace>& sys_dist_ks, distributed<db::view::view_update_generator>& view_update_generator);
static future<> uninit_streaming_service();
public:
/**
* Streaming endpoint.

View File

@@ -1437,7 +1437,7 @@ future<std::unordered_set<sstring>> table::get_sstables_by_partition_key(const s
[this] (std::unordered_set<sstring>& filenames, lw_shared_ptr<sstables::sstable_set::incremental_selector>& sel, partition_key& pk) {
return do_with(dht::decorated_key(dht::decorate_key(*_schema, pk)),
[this, &filenames, &sel, &pk](dht::decorated_key& dk) mutable {
auto sst = sel->select(dk).sstables;
const auto& sst = sel->select(dk).sstables;
auto hk = sstables::sstable::make_hashed_key(*_schema, dk.key());
return do_for_each(sst, [this, &filenames, &dk, hk = std::move(hk)] (std::vector<sstables::shared_sstable>::const_iterator::reference s) mutable {
@@ -1466,7 +1466,7 @@ std::vector<sstables::shared_sstable> table::select_sstables(const dht::partitio
return _sstables->select(range);
}
std::vector<sstables::shared_sstable> table::candidates_for_compaction() const {
std::vector<sstables::shared_sstable> table::non_staging_sstables() const {
return boost::copy_range<std::vector<sstables::shared_sstable>>(*get_sstables()
| boost::adaptors::filtered([this] (auto& sst) {
return !_sstables_need_rewrite.count(sst->generation()) && !_sstables_staging.count(sst->generation());
@@ -2034,6 +2034,11 @@ void table::set_schema(schema_ptr s) {
}
_schema = std::move(s);
for (auto&& v : _views) {
v->view_info()->set_base_info(
v->view_info()->make_base_dependent_view_info(*_schema));
}
set_compaction_strategy(_schema->compaction_strategy());
trigger_compaction();
}
@@ -2045,7 +2050,8 @@ static std::vector<view_ptr>::iterator find_view(std::vector<view_ptr>& views, c
}
void table::add_or_update_view(view_ptr v) {
v->view_info()->initialize_base_dependent_fields(*schema());
v->view_info()->set_base_info(
v->view_info()->make_base_dependent_view_info(*_schema));
auto existing = find_view(_views, v);
if (existing != _views.end()) {
*existing = std::move(v);
@@ -2098,7 +2104,7 @@ static size_t memory_usage_of(const std::vector<frozen_mutation_and_schema>& ms)
* @return a future resolving to the mutations to apply to the views, which can be empty.
*/
future<> table::generate_and_propagate_view_updates(const schema_ptr& base,
std::vector<view_ptr>&& views,
std::vector<db::view::view_and_base>&& views,
mutation&& m,
flat_mutation_reader_opt existings) const {
auto base_token = m.token();
@@ -2206,7 +2212,7 @@ table::local_base_lock(
* @return a future that resolves when the updates have been acknowledged by the view replicas
*/
future<> table::populate_views(
std::vector<view_ptr> views,
std::vector<db::view::view_and_base> views,
dht::token base_token,
flat_mutation_reader&& reader) {
auto& schema = reader.schema();
@@ -2527,7 +2533,7 @@ future<row_locker::lock_holder> table::do_push_view_replica_updates(const schema
}
auto& base = schema();
m.upgrade(base);
auto views = affected_views(base, m);
auto views = db::view::with_base_info_snapshot(affected_views(base, m));
if (views.empty()) {
return make_ready_future<row_locker::lock_holder>();
}

View File

@@ -28,8 +28,8 @@ fi
SCYLLA_IP=127.1.$(($$ >> 8 & 255)).$(($$ & 255))
echo "Running Scylla on $SCYLLA_IP"
tmp_dir=/tmp/alternator-test-$$
mkdir $tmp_dir
tmp_dir="$(readlink -e ${TMPDIR-/tmp})"/alternator-test-$$
mkdir "$tmp_dir"
# We run the cleanup() function on exit for any reason - successful finish
# of the script, an error (since we have "set -e"), or a signal.

View File

@@ -1351,3 +1351,37 @@ def test_condition_expression_with_forbidden_rmw(scylla_only, dynamodb, test_tab
assert test_table_s.get_item(Key={'p': s}, ConsistentRead=True)['Item'] == {'p': s, 'regular': 'write'}
test_table_s.update_item(Key={'p': s}, AttributeUpdates={'write': {'Value': 'regular', 'Action': 'PUT'}})
assert test_table_s.get_item(Key={'p': s}, ConsistentRead=True)['Item'] == {'p': s, 'regular': 'write', 'write': 'regular'}
# Reproducer for issue #6573: binary strings should be ordered as unsigned
# bytes, i.e., byte 128 comes after 127, not before as with signed bytes.
# Test the five ordering operators: <, <=, >, >=, between
def test_condition_expression_unsigned_bytes(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'b': bytearray([127])})
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='b < :oldval',
ExpressionAttributeValues={':newval': 1, ':oldval': bytearray([128])})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 1
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='b <= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': bytearray([128])})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 2
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='b between :oldval1 and :oldval2',
ExpressionAttributeValues={':newval': 3, ':oldval1': bytearray([126]), ':oldval2': bytearray([128])})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 3
test_table_s.put_item(Item={'p': p, 'b': bytearray([128])})
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='b > :oldval',
ExpressionAttributeValues={':newval': 4, ':oldval': bytearray([127])})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 4
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='b >= :oldval',
ExpressionAttributeValues={':newval': 5, ':oldval': bytearray([127])})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 5

View File

@@ -1077,3 +1077,42 @@ def test_put_item_expected(test_table_s):
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'a': 2}
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.put_item(Item={'p': p, 'a': 3}, Expected={'a': {'Value': 1}})
# Reproducer for issue #6573: binary strings should be ordered as unsigned
# bytes, i.e., byte 128 comes after 127, not before as with signed bytes.
# Test the five ordering operators: LT, LE, GT, GE, BETWEEN
def test_update_expected_unsigned_bytes(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p, 'b': bytearray([127])})
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'z': {'Value': 1, 'Action': 'PUT'}},
Expected={'b': {'ComparisonOperator': 'LT',
'AttributeValueList': [bytearray([128])]}}
)
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 1
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'z': {'Value': 2, 'Action': 'PUT'}},
Expected={'b': {'ComparisonOperator': 'LE',
'AttributeValueList': [bytearray([128])]}}
)
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 2
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'z': {'Value': 3, 'Action': 'PUT'}},
Expected={'b': {'ComparisonOperator': 'BETWEEN',
'AttributeValueList': [bytearray([126]), bytearray([128])]}}
)
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 3
test_table_s.put_item(Item={'p': p, 'b': bytearray([128])})
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'z': {'Value': 4, 'Action': 'PUT'}},
Expected={'b': {'ComparisonOperator': 'GT',
'AttributeValueList': [bytearray([127])]}}
)
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 4
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'z': {'Value': 5, 'Action': 'PUT'}},
Expected={'b': {'ComparisonOperator': 'GE',
'AttributeValueList': [bytearray([127])]}}
)
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 5

View File

@@ -520,6 +520,42 @@ def test_key_condition_expression_and_conditions(test_table_sn_with_sorted_parti
'ComparisonOperator': 'GT'}}
)
# Demonstrate that issue #6573 was not a bug for KeyConditionExpression:
# binary strings are ordered as unsigned bytes, i.e., byte 128 comes after
# 127, not as signed bytes.
# Test the five ordering operators: <, <=, >, >=, between
def test_key_condition_expression_unsigned_bytes(test_table_sb):
p = random_string()
items = [{'p': p, 'c': bytearray([i])} for i in range(126,129)]
with test_table_sb.batch_writer() as batch:
for item in items:
batch.put_item(item)
got_items = full_query(test_table_sb,
KeyConditionExpression='p=:p AND c<:c',
ExpressionAttributeValues={':p': p, ':c': bytearray([127])})
expected_items = [item for item in items if item['c'] < bytearray([127])]
assert(got_items == expected_items)
got_items = full_query(test_table_sb,
KeyConditionExpression='p=:p AND c<=:c',
ExpressionAttributeValues={':p': p, ':c': bytearray([127])})
expected_items = [item for item in items if item['c'] <= bytearray([127])]
assert(got_items == expected_items)
got_items = full_query(test_table_sb,
KeyConditionExpression='p=:p AND c>:c',
ExpressionAttributeValues={':p': p, ':c': bytearray([127])})
expected_items = [item for item in items if item['c'] > bytearray([127])]
assert(got_items == expected_items)
got_items = full_query(test_table_sb,
KeyConditionExpression='p=:p AND c>=:c',
ExpressionAttributeValues={':p': p, ':c': bytearray([127])})
expected_items = [item for item in items if item['c'] >= bytearray([127])]
assert(got_items == expected_items)
got_items = full_query(test_table_sb,
KeyConditionExpression='p=:p AND c BETWEEN :c1 AND :c2',
ExpressionAttributeValues={':p': p, ':c1': bytearray([127]), ':c2': bytearray([128])})
expected_items = [item for item in items if item['c'] >= bytearray([127]) and item['c'] <= bytearray([128])]
assert(got_items == expected_items)
# The following is an older test we had, which test one arbitrary use case
# for KeyConditionExpression. It uses filled_test_table (the one we also
# use in test_scan.py) instead of the fixtures defined in this file.

View File

@@ -3443,10 +3443,13 @@ SEASTAR_TEST_CASE(test_select_with_mixed_order_table) {
}
uint64_t
run_and_examine_cache_stat_change(cql_test_env& e, uint64_t cache_tracker::stats::*metric, std::function<void (cql_test_env& e)> func) {
run_and_examine_cache_read_stats_change(cql_test_env& e, std::string_view cf_name, std::function<void (cql_test_env& e)> func) {
auto read_stat = [&] {
auto local_read_metric = [metric] (database& db) { return db.row_cache_tracker().get_stats().*metric; };
return e.db().map_reduce0(local_read_metric, uint64_t(0), std::plus<uint64_t>()).get0();
return e.db().map_reduce0([&cf_name] (const database& db) {
auto& t = db.find_column_family("ks", sstring(cf_name));
auto& stats = t.get_row_cache().stats();
return stats.reads_with_misses.count() + stats.reads_with_no_misses.count();
}, uint64_t(0), std::plus<uint64_t>()).get0();
};
auto before = read_stat();
func(e);
@@ -3457,11 +3460,11 @@ run_and_examine_cache_stat_change(cql_test_env& e, uint64_t cache_tracker::stats
SEASTAR_TEST_CASE(test_cache_bypass) {
return do_with_cql_env_thread([] (cql_test_env& e) {
e.execute_cql("CREATE TABLE t (k int PRIMARY KEY)").get();
auto with_cache = run_and_examine_cache_stat_change(e, &cache_tracker::stats::reads, [] (cql_test_env& e) {
auto with_cache = run_and_examine_cache_read_stats_change(e, "t", [] (cql_test_env& e) {
e.execute_cql("SELECT * FROM t").get();
});
BOOST_REQUIRE(with_cache >= smp::count); // scan may make multiple passes per shard
auto without_cache = run_and_examine_cache_stat_change(e, &cache_tracker::stats::reads, [] (cql_test_env& e) {
auto without_cache = run_and_examine_cache_read_stats_change(e, "t", [] (cql_test_env& e) {
e.execute_cql("SELECT * FROM t BYPASS CACHE").get();
});
BOOST_REQUIRE_EQUAL(without_cache, 0);
@@ -4523,3 +4526,12 @@ SEASTAR_TEST_CASE(test_impossible_where) {
require_rows(e, "SELECT * FROM t2 WHERE c>=10 AND c<=0 ALLOW FILTERING", {});
});
}
SEASTAR_TEST_CASE(test_counter_column_added_into_non_counter_table) {
return do_with_cql_env_thread([] (cql_test_env& e) {
cquery_nofail(e, "CREATE TABLE t (pk int, ck int, PRIMARY KEY(pk, ck))");
BOOST_REQUIRE_THROW(e.execute_cql("ALTER TABLE t ADD \"c\" counter;").get(),
exceptions::configuration_exception);
});
}

View File

@@ -1134,6 +1134,9 @@ SEASTAR_TEST_CASE(test_filtering) {
{ int32_type->decompose(8), int32_type->decompose(3) },
{ int32_type->decompose(9), int32_type->decompose(3) },
});
require_rows(e, "SELECT k FROM cf WHERE k=12 AND (m,n)>=(4,0) ALLOW FILTERING;", {
{ int32_type->decompose(12), int32_type->decompose(4), int32_type->decompose(5)},
});
}
// test filtering on clustering keys

View File

@@ -43,6 +43,7 @@
#include "test/lib/make_random_string.hh"
#include "test/lib/dummy_partitioner.hh"
#include "test/lib/reader_lifecycle_policy.hh"
#include "test/lib/random_utils.hh"
#include "dht/sharder.hh"
#include "mutation_reader.hh"
@@ -2737,3 +2738,597 @@ SEASTAR_THREAD_TEST_CASE(test_compacting_reader_next_partition) {
}
reader_assertions.produces_end_of_stream();
}
SEASTAR_THREAD_TEST_CASE(test_auto_paused_evictable_reader_is_mutation_source) {
auto make_populate = [] (schema_ptr s, const std::vector<mutation>& mutations, gc_clock::time_point query_time) {
auto mt = make_lw_shared<memtable>(s);
for (auto& mut : mutations) {
mt->apply(mut);
}
auto sem = make_lw_shared<reader_concurrency_semaphore>(reader_concurrency_semaphore::no_limits());
return mutation_source([=] (
schema_ptr s,
reader_permit permit,
const dht::partition_range& range,
const query::partition_slice& slice,
const io_priority_class& pc,
tracing::trace_state_ptr trace_state,
streamed_mutation::forwarding fwd_sm,
mutation_reader::forwarding fwd_mr) mutable {
auto mr = make_auto_paused_evictable_reader(mt->as_data_source(), std::move(s), *sem, range, slice, pc, std::move(trace_state), fwd_mr);
if (fwd_sm == streamed_mutation::forwarding::yes) {
return make_forwardable(std::move(mr));
}
return mr;
});
};
run_mutation_source_tests(make_populate);
}
SEASTAR_THREAD_TEST_CASE(test_manual_paused_evictable_reader_is_mutation_source) {
class maybe_pausing_reader : public flat_mutation_reader::impl {
flat_mutation_reader _reader;
std::optional<evictable_reader_handle> _handle;
private:
void maybe_pause() {
if (!tests::random::get_int(0, 4)) {
_handle->pause();
}
}
public:
maybe_pausing_reader(
memtable& mt,
reader_concurrency_semaphore& semaphore,
const dht::partition_range& pr,
const query::partition_slice& ps,
const io_priority_class& pc,
tracing::trace_state_ptr trace_state,
mutation_reader::forwarding fwd_mr)
: impl(mt.schema()), _reader(nullptr) {
std::tie(_reader, _handle) = make_manually_paused_evictable_reader(mt.as_data_source(), mt.schema(), semaphore, pr, ps, pc,
std::move(trace_state), fwd_mr);
}
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override {
return _reader.fill_buffer(timeout).then([this] {
_end_of_stream = _reader.is_end_of_stream();
_reader.move_buffer_content_to(*this);
}).then([this] {
maybe_pause();
});
}
virtual void next_partition() override {
clear_buffer_to_next_partition();
if (!is_buffer_empty()) {
return;
}
_end_of_stream = false;
_reader.next_partition();
}
virtual future<> fast_forward_to(const dht::partition_range& pr, db::timeout_clock::time_point timeout) override {
clear_buffer();
_end_of_stream = false;
return _reader.fast_forward_to(pr, timeout).then([this] {
maybe_pause();
});
}
virtual future<> fast_forward_to(position_range pr, db::timeout_clock::time_point timeout) override {
throw_with_backtrace<std::bad_function_call>();
}
virtual size_t buffer_size() const override {
return flat_mutation_reader::impl::buffer_size() + _reader.buffer_size();
}
};
auto make_populate = [this] (schema_ptr s, const std::vector<mutation>& mutations, gc_clock::time_point query_time) {
auto mt = make_lw_shared<memtable>(s);
for (auto& mut : mutations) {
mt->apply(mut);
}
auto sem = make_lw_shared<reader_concurrency_semaphore>(reader_concurrency_semaphore::no_limits());
return mutation_source([=] (
schema_ptr s,
reader_permit permit,
const dht::partition_range& range,
const query::partition_slice& slice,
const io_priority_class& pc,
tracing::trace_state_ptr trace_state,
streamed_mutation::forwarding fwd_sm,
mutation_reader::forwarding fwd_mr) mutable {
auto mr = make_flat_mutation_reader<maybe_pausing_reader>(*mt, *sem, range, slice, pc, std::move(trace_state), fwd_mr);
if (fwd_sm == streamed_mutation::forwarding::yes) {
return make_forwardable(std::move(mr));
}
return mr;
});
};
run_mutation_source_tests(make_populate);
}
namespace {
std::deque<mutation_fragment> copy_fragments(const schema& s, const std::deque<mutation_fragment>& o) {
std::deque<mutation_fragment> buf;
for (const auto& mf : o) {
buf.emplace_back(s, mf);
}
return buf;
}
flat_mutation_reader create_evictable_reader_and_evict_after_first_buffer(
schema_ptr schema,
reader_concurrency_semaphore& semaphore,
const dht::partition_range& prange,
const query::partition_slice& slice,
std::deque<mutation_fragment> first_buffer,
position_in_partition_view last_fragment_position,
std::deque<mutation_fragment> second_buffer,
size_t max_buffer_size) {
class factory {
schema_ptr _schema;
std::optional<std::deque<mutation_fragment>> _first_buffer;
std::optional<std::deque<mutation_fragment>> _second_buffer;
size_t _max_buffer_size;
private:
std::optional<std::deque<mutation_fragment>> copy_buffer(const std::optional<std::deque<mutation_fragment>>& o) {
if (!o) {
return {};
}
return copy_fragments(*_schema, *o);
}
public:
factory(schema_ptr schema, std::deque<mutation_fragment> first_buffer, std::deque<mutation_fragment> second_buffer, size_t max_buffer_size)
: _schema(std::move(schema)), _first_buffer(std::move(first_buffer)), _second_buffer(std::move(second_buffer)), _max_buffer_size(max_buffer_size) {
}
factory(const factory& o)
: _schema(o._schema)
, _first_buffer(copy_buffer(o._first_buffer))
, _second_buffer(copy_buffer(o._second_buffer)) {
}
factory(factory&& o) = default;
flat_mutation_reader operator()(
schema_ptr s,
reader_permit permit,
const dht::partition_range& range,
const query::partition_slice& slice,
const io_priority_class& pc,
tracing::trace_state_ptr trace_state,
streamed_mutation::forwarding fwd_sm,
mutation_reader::forwarding fwd_mr) {
BOOST_REQUIRE(s == _schema);
if (_first_buffer) {
auto buf = *std::exchange(_first_buffer, {});
auto rd = make_flat_mutation_reader_from_fragments(_schema, std::move(buf));
rd.set_max_buffer_size(_max_buffer_size);
return rd;
}
if (_second_buffer) {
auto buf = *std::exchange(_second_buffer, {});
auto rd = make_flat_mutation_reader_from_fragments(_schema, std::move(buf));
rd.set_max_buffer_size(_max_buffer_size);
return rd;
}
return make_empty_flat_reader(_schema);
}
};
auto ms = mutation_source(factory(schema, std::move(first_buffer), std::move(second_buffer), max_buffer_size));
auto [rd, handle] = make_manually_paused_evictable_reader(
std::move(ms),
schema,
semaphore,
prange,
slice,
seastar::default_priority_class(),
nullptr,
mutation_reader::forwarding::yes);
rd.set_max_buffer_size(max_buffer_size);
rd.fill_buffer(db::no_timeout).get0();
const auto eq_cmp = position_in_partition::equal_compare(*schema);
BOOST_REQUIRE(rd.is_buffer_full());
BOOST_REQUIRE(eq_cmp(rd.buffer().back().position(), last_fragment_position));
BOOST_REQUIRE(!rd.is_end_of_stream());
rd.detach_buffer();
handle.pause();
while(semaphore.try_evict_one_inactive_read());
return std::move(rd);
}
}
SEASTAR_THREAD_TEST_CASE(test_evictable_reader_trim_range_tombstones) {
reader_concurrency_semaphore semaphore(reader_concurrency_semaphore::no_limits{});
simple_schema s;
const auto pkey = s.make_pkey();
size_t max_buffer_size = 512;
const int first_ck = 100;
const int second_buffer_ck = first_ck + 100;
size_t mem_usage = 0;
std::deque<mutation_fragment> first_buffer;
first_buffer.emplace_back(partition_start{pkey, {}});
mem_usage = first_buffer.back().memory_usage(*s.schema());
for (int i = 0; i < second_buffer_ck; ++i) {
first_buffer.emplace_back(s.make_row(s.make_ckey(i++), "v"));
mem_usage += first_buffer.back().memory_usage(*s.schema());
}
const auto last_fragment_position = position_in_partition(first_buffer.back().position());
max_buffer_size = mem_usage;
first_buffer.emplace_back(s.make_row(s.make_ckey(second_buffer_ck), "v"));
std::deque<mutation_fragment> second_buffer;
second_buffer.emplace_back(partition_start{pkey, {}});
mem_usage = second_buffer.back().memory_usage(*s.schema());
second_buffer.emplace_back(s.make_range_tombstone(query::clustering_range::make_ending_with(s.make_ckey(second_buffer_ck + 10))));
int ckey = second_buffer_ck;
while (mem_usage <= max_buffer_size) {
second_buffer.emplace_back(s.make_row(s.make_ckey(ckey++), "v"));
mem_usage += second_buffer.back().memory_usage(*s.schema());
}
second_buffer.emplace_back(partition_end{});
auto rd = create_evictable_reader_and_evict_after_first_buffer(s.schema(), semaphore, query::full_partition_range,
s.schema()->full_slice(), std::move(first_buffer), last_fragment_position, std::move(second_buffer), max_buffer_size);
rd.fill_buffer(db::no_timeout).get();
const auto tri_cmp = position_in_partition::tri_compare(*s.schema());
BOOST_REQUIRE(tri_cmp(last_fragment_position, rd.peek_buffer().position()) < 0);
}
namespace {
void check_evictable_reader_validation_is_triggered(
std::string_view test_name,
std::string_view error_prefix, // empty str if no exception is expected
schema_ptr schema,
reader_concurrency_semaphore& semaphore,
const dht::partition_range& prange,
const query::partition_slice& slice,
std::deque<mutation_fragment> first_buffer,
position_in_partition_view last_fragment_position,
std::deque<mutation_fragment> second_buffer,
size_t max_buffer_size) {
testlog.info("check_evictable_reader_validation_is_triggered(): checking {} test case: {}", error_prefix.empty() ? "positive" : "negative", test_name);
auto rd = create_evictable_reader_and_evict_after_first_buffer(std::move(schema), semaphore, prange, slice, std::move(first_buffer),
last_fragment_position, std::move(second_buffer), max_buffer_size);
const bool fail = !error_prefix.empty();
try {
rd.fill_buffer(db::no_timeout).get0();
} catch (std::runtime_error& e) {
if (fail) {
if (error_prefix == std::string_view(e.what(), error_prefix.size())) {
testlog.trace("Expected exception caught: {}", std::current_exception());
return;
} else {
BOOST_FAIL(fmt::format("Exception with unexpected message caught: {}", std::current_exception()));
}
} else {
BOOST_FAIL(fmt::format("Unexpected exception caught: {}", std::current_exception()));
}
}
if (fail) {
BOOST_FAIL(fmt::format("Expected exception not thrown"));
}
}
}
SEASTAR_THREAD_TEST_CASE(test_evictable_reader_self_validation) {
set_abort_on_internal_error(false);
auto reset_on_internal_abort = defer([] {
set_abort_on_internal_error(true);
});
reader_concurrency_semaphore semaphore(reader_concurrency_semaphore::no_limits{});
simple_schema s;
auto pkeys = s.make_pkeys(4);
boost::sort(pkeys, dht::decorated_key::less_comparator(s.schema()));
size_t max_buffer_size = 512;
const int first_ck = 100;
const int second_buffer_ck = first_ck + 100;
const int last_ck = second_buffer_ck + 100;
static const char partition_error_prefix[] = "maybe_validate_partition_start(): validation failed";
static const char position_in_partition_error_prefix[] = "validate_position_in_partition(): validation failed";
static const char trim_range_tombstones_error_prefix[] = "maybe_trim_range_tombstone(): validation failed";
const auto prange = dht::partition_range::make(
dht::partition_range::bound(pkeys[1], true),
dht::partition_range::bound(pkeys[2], true));
const auto ckrange = query::clustering_range::make(
query::clustering_range::bound(s.make_ckey(first_ck), true),
query::clustering_range::bound(s.make_ckey(last_ck), true));
const auto slice = partition_slice_builder(*s.schema()).with_range(ckrange).build();
std::deque<mutation_fragment> first_buffer;
first_buffer.emplace_back(partition_start{pkeys[1], {}});
size_t mem_usage = first_buffer.back().memory_usage(*s.schema());
for (int i = 0; i < second_buffer_ck; ++i) {
first_buffer.emplace_back(s.make_row(s.make_ckey(i++), "v"));
mem_usage += first_buffer.back().memory_usage(*s.schema());
}
max_buffer_size = mem_usage;
auto last_fragment_position = position_in_partition(first_buffer.back().position());
first_buffer.emplace_back(s.make_row(s.make_ckey(second_buffer_ck), "v"));
auto make_second_buffer = [&s, &max_buffer_size, second_buffer_ck] (dht::decorated_key pkey, std::optional<int> first_ckey = {},
bool inject_range_tombstone = false) mutable {
auto ckey = first_ckey ? *first_ckey : second_buffer_ck;
std::deque<mutation_fragment> second_buffer;
second_buffer.emplace_back(partition_start{std::move(pkey), {}});
size_t mem_usage = second_buffer.back().memory_usage(*s.schema());
if (inject_range_tombstone) {
second_buffer.emplace_back(s.make_range_tombstone(query::clustering_range::make_ending_with(s.make_ckey(last_ck))));
}
while (mem_usage <= max_buffer_size) {
second_buffer.emplace_back(s.make_row(s.make_ckey(ckey++), "v"));
mem_usage += second_buffer.back().memory_usage(*s.schema());
}
second_buffer.emplace_back(partition_end{});
return second_buffer;
};
//
// Continuing the same partition
//
check_evictable_reader_validation_is_triggered(
"pkey < _last_pkey; pkey ∉ prange",
partition_error_prefix,
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
make_second_buffer(pkeys[0]),
max_buffer_size);
check_evictable_reader_validation_is_triggered(
"pkey == _last_pkey",
"",
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
make_second_buffer(pkeys[1]),
max_buffer_size);
check_evictable_reader_validation_is_triggered(
"pkey == _last_pkey; position_in_partition ∉ ckrange (<)",
position_in_partition_error_prefix,
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
make_second_buffer(pkeys[1], first_ck - 10),
max_buffer_size);
check_evictable_reader_validation_is_triggered(
"pkey == _last_pkey; position_in_partition ∉ ckrange (<); start with trimmable range-tombstone",
position_in_partition_error_prefix,
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
make_second_buffer(pkeys[1], first_ck - 10, true),
max_buffer_size);
check_evictable_reader_validation_is_triggered(
"pkey == _last_pkey; position_in_partition ∉ ckrange; position_in_partition < _next_position_in_partition",
position_in_partition_error_prefix,
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
make_second_buffer(pkeys[1], second_buffer_ck - 2),
max_buffer_size);
check_evictable_reader_validation_is_triggered(
"pkey == _last_pkey; position_in_partition ∉ ckrange; position_in_partition < _next_position_in_partition; start with trimmable range-tombstone",
position_in_partition_error_prefix,
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
make_second_buffer(pkeys[1], second_buffer_ck - 2, true),
max_buffer_size);
{
auto second_buffer = make_second_buffer(pkeys[1], second_buffer_ck);
second_buffer[1] = s.make_range_tombstone(query::clustering_range::make_ending_with(s.make_ckey(second_buffer_ck - 10)));
check_evictable_reader_validation_is_triggered(
"pkey == _last_pkey; end(range_tombstone) < _next_position_in_partition",
trim_range_tombstones_error_prefix,
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
std::move(second_buffer),
max_buffer_size);
}
{
auto second_buffer = make_second_buffer(pkeys[1], second_buffer_ck);
second_buffer[1] = s.make_range_tombstone(query::clustering_range::make_ending_with(s.make_ckey(second_buffer_ck + 10)));
check_evictable_reader_validation_is_triggered(
"pkey == _last_pkey; end(range_tombstone) > _next_position_in_partition",
"",
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
std::move(second_buffer),
max_buffer_size);
}
{
auto second_buffer = make_second_buffer(pkeys[1], second_buffer_ck);
second_buffer[1] = s.make_range_tombstone(query::clustering_range::make_starting_with(s.make_ckey(last_ck + 10)));
check_evictable_reader_validation_is_triggered(
"pkey == _last_pkey; start(range_tombstone) ∉ ckrange (>)",
position_in_partition_error_prefix,
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
std::move(second_buffer),
max_buffer_size);
}
check_evictable_reader_validation_is_triggered(
"pkey == _last_pkey; position_in_partition ∈ ckrange",
"",
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
make_second_buffer(pkeys[1], second_buffer_ck),
max_buffer_size);
check_evictable_reader_validation_is_triggered(
"pkey == _last_pkey; position_in_partition ∉ ckrange (>)",
position_in_partition_error_prefix,
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
make_second_buffer(pkeys[1], last_ck + 10),
max_buffer_size);
check_evictable_reader_validation_is_triggered(
"pkey > _last_pkey; pkey ∈ pkrange",
partition_error_prefix,
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
make_second_buffer(pkeys[2]),
max_buffer_size);
check_evictable_reader_validation_is_triggered(
"pkey > _last_pkey; pkey ∉ pkrange",
partition_error_prefix,
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
make_second_buffer(pkeys[3]),
max_buffer_size);
//
// Continuing from next partition
//
first_buffer.clear();
first_buffer.emplace_back(partition_start{pkeys[1], {}});
mem_usage = first_buffer.back().memory_usage(*s.schema());
for (int i = 0; i < second_buffer_ck; ++i) {
first_buffer.emplace_back(s.make_row(s.make_ckey(i++), "v"));
mem_usage += first_buffer.back().memory_usage(*s.schema());
}
first_buffer.emplace_back(partition_end{});
mem_usage += first_buffer.back().memory_usage(*s.schema());
last_fragment_position = position_in_partition(first_buffer.back().position());
max_buffer_size = mem_usage;
first_buffer.emplace_back(partition_start{pkeys[2], {}});
check_evictable_reader_validation_is_triggered(
"pkey < _last_pkey; pkey ∉ pkrange",
partition_error_prefix,
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
make_second_buffer(pkeys[0]),
max_buffer_size);
check_evictable_reader_validation_is_triggered(
"pkey == _last_pkey",
partition_error_prefix,
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
make_second_buffer(pkeys[1]),
max_buffer_size);
check_evictable_reader_validation_is_triggered(
"pkey > _last_pkey; pkey ∈ pkrange",
"",
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
make_second_buffer(pkeys[2]),
max_buffer_size);
check_evictable_reader_validation_is_triggered(
"pkey > _last_pkey; pkey ∉ pkrange",
partition_error_prefix,
s.schema(),
semaphore,
prange,
slice,
copy_fragments(*s.schema(), first_buffer),
last_fragment_position,
make_second_buffer(pkeys[3]),
max_buffer_size);
}

View File

@@ -4733,8 +4733,8 @@ SEASTAR_TEST_CASE(sstable_scrub_test) {
table->add_sstable_and_update_cache(sst).get();
BOOST_REQUIRE(table->candidates_for_compaction().size() == 1);
BOOST_REQUIRE(table->candidates_for_compaction().front() == sst);
BOOST_REQUIRE(table->non_staging_sstables().size() == 1);
BOOST_REQUIRE(table->non_staging_sstables().front() == sst);
auto verify_fragments = [&] (sstables::shared_sstable sst, const std::vector<mutation_fragment>& mfs) {
auto r = assert_that(sst->as_mutation_source().make_reader(schema));
@@ -4755,7 +4755,7 @@ SEASTAR_TEST_CASE(sstable_scrub_test) {
// We expect the scrub with skip_corrupted=false to stop on the first invalid fragment.
compaction_manager.perform_sstable_scrub(table.get(), false).get();
BOOST_REQUIRE(table->candidates_for_compaction().size() == 1);
BOOST_REQUIRE(table->non_staging_sstables().size() == 1);
verify_fragments(sst, corrupt_fragments);
testlog.info("Scrub with --skip-corrupted=true");
@@ -4763,9 +4763,9 @@ SEASTAR_TEST_CASE(sstable_scrub_test) {
// We expect the scrub with skip_corrupted=true to get rid of all invalid data.
compaction_manager.perform_sstable_scrub(table.get(), true).get();
BOOST_REQUIRE(table->candidates_for_compaction().size() == 1);
BOOST_REQUIRE(table->candidates_for_compaction().front() != sst);
verify_fragments(table->candidates_for_compaction().front(), scrubbed_fragments);
BOOST_REQUIRE(table->non_staging_sstables().size() == 1);
BOOST_REQUIRE(table->non_staging_sstables().front() != sst);
verify_fragments(table->non_staging_sstables().front(), scrubbed_fragments);
});
}, test_cfg);
}
@@ -5631,3 +5631,48 @@ SEASTAR_TEST_CASE(incremental_compaction_data_resurrection_test) {
BOOST_REQUIRE(is_partition_dead(alpha));
});
}
SEASTAR_TEST_CASE(sstable_needs_cleanup_test) {
test_env env;
auto s = make_lw_shared(schema({}, some_keyspace, some_column_family,
{{"p1", utf8_type}}, {}, {}, {}, utf8_type));
auto tokens = token_generation_for_current_shard(10);
auto sst_gen = [&env, s, gen = make_lw_shared<unsigned>(1)] (sstring first, sstring last) mutable {
return sstable_for_overlapping_test(env, s, (*gen)++, first, last);
};
auto token = [&] (size_t index) -> dht::token {
return tokens[index].second;
};
auto key_from_token = [&] (size_t index) -> sstring {
return tokens[index].first;
};
auto token_range = [&] (size_t first, size_t last) -> dht::token_range {
return dht::token_range::make(token(first), token(last));
};
{
auto local_ranges = { token_range(0, 9) };
auto sst = sst_gen(key_from_token(0), key_from_token(9));
BOOST_REQUIRE(!needs_cleanup(sst, local_ranges, s));
}
{
auto local_ranges = { token_range(0, 1), token_range(3, 4), token_range(5, 6) };
auto sst = sst_gen(key_from_token(0), key_from_token(1));
BOOST_REQUIRE(!needs_cleanup(sst, local_ranges, s));
auto sst2 = sst_gen(key_from_token(2), key_from_token(2));
BOOST_REQUIRE(needs_cleanup(sst2, local_ranges, s));
auto sst3 = sst_gen(key_from_token(0), key_from_token(6));
BOOST_REQUIRE(needs_cleanup(sst3, local_ranges, s));
auto sst5 = sst_gen(key_from_token(7), key_from_token(7));
BOOST_REQUIRE(needs_cleanup(sst5, local_ranges, s));
}
return make_ready_future<>();
}

View File

@@ -0,0 +1,55 @@
/*
* Copyright (C) 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <seastar/testing/thread_test_case.hh>
#include "utils/stall_free.hh"
SEASTAR_THREAD_TEST_CASE(test_merge1) {
std::list<int> l1{1, 2, 5, 8};
std::list<int> l2{3};
std::list<int> expected{1,2,3,5,8};
utils::merge_to_gently(l1, l2, std::less<int>());
BOOST_CHECK(l1 == expected);
}
SEASTAR_THREAD_TEST_CASE(test_merge2) {
std::list<int> l1{1};
std::list<int> l2{3, 5, 6};
std::list<int> expected{1,3,5,6};
utils::merge_to_gently(l1, l2, std::less<int>());
BOOST_CHECK(l1 == expected);
}
SEASTAR_THREAD_TEST_CASE(test_merge3) {
std::list<int> l1{};
std::list<int> l2{3, 5, 6};
std::list<int> expected{3,5,6};
utils::merge_to_gently(l1, l2, std::less<int>());
BOOST_CHECK(l1 == expected);
}
SEASTAR_THREAD_TEST_CASE(test_merge4) {
std::list<int> l1{1};
std::list<int> l2{};
std::list<int> expected{1};
utils::merge_to_gently(l1, l2, std::less<int>());
BOOST_CHECK(l1 == expected);
}

View File

@@ -118,6 +118,25 @@ SEASTAR_TEST_CASE(test_access_and_schema) {
});
}
SEASTAR_TEST_CASE(test_column_dropped_from_base) {
return do_with_cql_env_thread([] (auto& e) {
e.execute_cql("create table cf (p int, c ascii, a int, v int, primary key (p, c));").get();
e.execute_cql("create materialized view vcf as select p, c, v from cf "
"where v is not null and p is not null and c is not null "
"primary key (v, p, c)").get();
e.execute_cql("alter table cf drop a;").get();
e.execute_cql("insert into cf (p, c, v) values (0, 'foo', 1);").get();
eventually([&] {
auto msg = e.execute_cql("select v from vcf").get0();
assert_that(msg).is_rows()
.with_size(1)
.with_row({
{int32_type->decompose(1)}
});
});
});
}
SEASTAR_TEST_CASE(test_updates) {
return do_with_cql_env_thread([] (auto& e) {
e.execute_cql("create table base (k int, v int, primary key (k));").get();

View File

@@ -0,0 +1,29 @@
CREATE TABLE ks.tbl_cnt (pk int PRIMARY KEY, c1 counter, c2 counter);
-- insert some values in one column
UPDATE ks.tbl_cnt SET c1 = c1+1 WHERE pk = 1;
UPDATE ks.tbl_cnt SET c1 = c1+2 WHERE pk = 2;
UPDATE ks.tbl_cnt SET c1 = c1+3 WHERE pk = 3;
UPDATE ks.tbl_cnt SET c1 = c1+4 WHERE pk = 4;
-- test various filtering options on counter column
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 < 3 ALLOW FILTERING;
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 < 1 ALLOW FILTERING;
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 <= 3 ALLOW FILTERING;
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 > 2 AND pk = 4 ALLOW FILTERING;
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 >= 3 and pk = 3 ALLOW FILTERING;
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 > 4 ALLOW FILTERING;
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 in (-1, 2, 3) ALLOW FILTERING;
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 = 0 ALLOW FILTERING;
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 = 1 ALLOW FILTERING;
-- now filter through untouched counters `c2` - they should appear as NULLs and evaluate as zeros
SELECT pk, c1, c2 FROM ks.tbl_cnt WHERE c2 = 0 ALLOW FILTERING;
SELECT pk, c2 FROM ks.tbl_cnt WHERE c2 < 0 ALLOW FILTERING;
SELECT pk, c2 FROM ks.tbl_cnt WHERE c2 > 0 ALLOW FILTERING;
-- delete `c1` and make sure it doesn't appear in filtering results
DELETE c1 from ks.tbl_cnt WHERE pk = 1;
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 = 1 ALLOW FILTERING;
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 <= 1000 ALLOW FILTERING;
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 > -1000 ALLOW FILTERING;

View File

@@ -0,0 +1,190 @@
CREATE TABLE ks.tbl_cnt (pk int PRIMARY KEY, c1 counter, c2 counter);
{
"status" : "ok"
}
-- insert some values in one column
UPDATE ks.tbl_cnt SET c1 = c1+1 WHERE pk = 1;
{
"status" : "ok"
}
UPDATE ks.tbl_cnt SET c1 = c1+2 WHERE pk = 2;
{
"status" : "ok"
}
UPDATE ks.tbl_cnt SET c1 = c1+3 WHERE pk = 3;
{
"status" : "ok"
}
UPDATE ks.tbl_cnt SET c1 = c1+4 WHERE pk = 4;
{
"status" : "ok"
}
-- test various filtering options on counter column
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 < 3 ALLOW FILTERING;
{
"rows" :
[
{
"c1" : "1",
"pk" : "1"
},
{
"c1" : "2",
"pk" : "2"
}
]
}
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 < 1 ALLOW FILTERING;
{
"rows" : null
}
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 <= 3 ALLOW FILTERING;
{
"rows" :
[
{
"c1" : "1",
"pk" : "1"
},
{
"c1" : "2",
"pk" : "2"
},
{
"c1" : "3",
"pk" : "3"
}
]
}
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 > 2 AND pk = 4 ALLOW FILTERING;
{
"rows" :
[
{
"c1" : "4",
"pk" : "4"
}
]
}
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 >= 3 and pk = 3 ALLOW FILTERING;
{
"rows" :
[
{
"c1" : "3",
"pk" : "3"
}
]
}
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 > 4 ALLOW FILTERING;
{
"rows" : null
}
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 in (-1, 2, 3) ALLOW FILTERING;
{
"rows" :
[
{
"c1" : "2",
"pk" : "2"
},
{
"c1" : "3",
"pk" : "3"
}
]
}
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 = 0 ALLOW FILTERING;
{
"rows" : null
}
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 = 1 ALLOW FILTERING;
{
"rows" :
[
{
"c1" : "1",
"pk" : "1"
}
]
}
-- now filter through untouched counters `c2` - they should appear as NULLs and evaluate as zeros
SELECT pk, c1, c2 FROM ks.tbl_cnt WHERE c2 = 0 ALLOW FILTERING;
{
"rows" :
[
{
"c1" : "1",
"pk" : "1"
},
{
"c1" : "2",
"pk" : "2"
},
{
"c1" : "4",
"pk" : "4"
},
{
"c1" : "3",
"pk" : "3"
}
]
}
SELECT pk, c2 FROM ks.tbl_cnt WHERE c2 < 0 ALLOW FILTERING;
{
"rows" : null
}
SELECT pk, c2 FROM ks.tbl_cnt WHERE c2 > 0 ALLOW FILTERING;
{
"rows" : null
}
-- delete `c1` and make sure it doesn't appear in filtering results
DELETE c1 from ks.tbl_cnt WHERE pk = 1;
{
"status" : "ok"
}
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 = 1 ALLOW FILTERING;
{
"rows" : null
}
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 <= 1000 ALLOW FILTERING;
{
"rows" :
[
{
"c1" : "2",
"pk" : "2"
},
{
"c1" : "4",
"pk" : "4"
},
{
"c1" : "3",
"pk" : "3"
}
]
}
SELECT pk, c1 FROM ks.tbl_cnt WHERE c1 > -1000 ALLOW FILTERING;
{
"rows" :
[
{
"c1" : "2",
"pk" : "2"
},
{
"c1" : "4",
"pk" : "4"
},
{
"c1" : "3",
"pk" : "3"
}
]
}

View File

@@ -289,9 +289,9 @@ cql3::query_options trace_keyspace_helper::make_slow_query_mutation_data(const o
auto millis_since_epoch = std::chrono::duration_cast<std::chrono::milliseconds>(record.started_at.time_since_epoch()).count();
// query command is stored on a parameters map with a 'query' key
auto query_str_it = record.parameters.find("query");
const auto query_str_it = record.parameters.find("query");
if (query_str_it == record.parameters.end()) {
throw std::logic_error("No \"query\" parameter set for a session requesting a slow_query_log record");
tlogger.trace("No \"query\" parameter set for a session requesting a slow_query_log record");
}
// parameters map
@@ -312,7 +312,9 @@ cql3::query_options trace_keyspace_helper::make_slow_query_mutation_data(const o
cql3::raw_value::make_value(uuid_type->decompose(session_records.session_id)),
cql3::raw_value::make_value(timestamp_type->decompose(millis_since_epoch)),
cql3::raw_value::make_value(timeuuid_type->decompose(start_time_id)),
cql3::raw_value::make_value(utf8_type->decompose(query_str_it->second)),
query_str_it != record.parameters.end()
? cql3::raw_value::make_value(utf8_type->decompose(query_str_it->second))
: cql3::raw_value::make_null(),
cql3::raw_value::make_value(int32_type->decompose(elapsed_to_micros(record.elapsed))),
cql3::raw_value::make_value(make_map_value(my_map_type, map_type_impl::native_type(std::move(parameters_values_vector))).serialize()),
cql3::raw_value::make_value(inet_addr_type->decompose(record.client.addr())),

View File

@@ -1985,8 +1985,10 @@ struct compare_visitor {
int32_t operator()(const empty_type_impl&) { return 0; }
int32_t operator()(const tuple_type_impl& t) { return compare_aux(t, v1, v2); }
int32_t operator()(const counter_type_impl&) {
fail(unimplemented::cause::COUNTERS);
return 0;
// untouched (empty) counter evaluates as 0
const auto a = v1.empty() ? 0 : simple_type_traits<int64_t>::read_nonempty(v1);
const auto b = v2.empty() ? 0 : simple_type_traits<int64_t>::read_nonempty(v2);
return a == b ? 0 : a < b ? -1 : 1;
}
int32_t operator()(const decimal_type_impl& d) {
if (v1.empty()) {

69
utils/stall_free.hh Normal file
View File

@@ -0,0 +1,69 @@
/*
* Copyright (C) 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <list>
#include <algorithm>
#include <seastar/core/thread.hh>
#include <seastar/core/future.hh>
#include <seastar/core/future-util.hh>
namespace utils {
// Similar to std::merge but it does not stall. Must run inside a seastar
// thread. It merges items from list2 into list1. Items from list2 can only be copied.
template<class T, class Compare>
void merge_to_gently(std::list<T>& list1, const std::list<T>& list2, Compare comp) {
auto first1 = list1.begin();
auto first2 = list2.begin();
auto last1 = list1.end();
auto last2 = list2.end();
while (first2 != last2) {
seastar::thread::maybe_yield();
if (first1 == last1) {
// Copy remaining items of list2 into list1
std::copy_if(first2, last2, std::back_inserter(list1), [] (const auto&) { return true; });
return;
}
if (comp(*first2, *first1)) {
first1 = list1.insert(first1, *first2);
++first2;
} else {
++first1;
}
}
}
template<class T>
seastar::future<> clear_gently(std::list<T>& list) {
return repeat([&list] () mutable {
if (list.empty()) {
return seastar::stop_iteration::yes;
}
list.pop_back();
return seastar::stop_iteration::no;
});
}
}

View File

@@ -24,6 +24,7 @@
#include "dht/i_partitioner.hh"
#include "query-request.hh"
#include "schema_fwd.hh"
#include "db/view/view.hh"
namespace cql3::statements {
class select_statement;
@@ -35,9 +36,7 @@ class view_info final {
// The following fields are used to select base table rows.
mutable shared_ptr<cql3::statements::select_statement> _select_statement;
mutable std::optional<query::partition_slice> _partition_slice;
// Id of a regular base table column included in the view's PK, if any.
// Scylla views only allow one such column, alternator can have up to two.
mutable std::vector<column_id> _base_non_pk_columns_in_view_pk;
db::view::base_info_ptr _base_info;
public:
view_info(const schema& schema, const raw_view_info& raw_view_info);
@@ -65,8 +64,22 @@ public:
const query::partition_slice& partition_slice() const;
const column_definition* view_column(const schema& base, column_id base_id) const;
const column_definition* view_column(const column_definition& base_def) const;
const std::vector<column_id>& base_non_pk_columns_in_view_pk() const;
void initialize_base_dependent_fields(const schema& base);
bool has_base_non_pk_columns_in_view_pk() const;
/// Returns a pointer to the base_dependent_view_info which matches the current
/// schema of the base table.
///
/// base_dependent_view_info lives separately from the view schema.
/// It can change without the view schema changing its value.
/// This pointer is updated on base table schema changes as long as this view_info
/// corresponds to the current schema of the view. After that the pointer stops tracking
/// the base table schema.
///
/// The snapshot of both the view schema and base_dependent_view_info is represented
/// by view_and_base. See with_base_info_snapshot().
const db::view::base_info_ptr& base_info() const { return _base_info; }
void set_base_info(db::view::base_info_ptr);
db::view::base_info_ptr make_base_dependent_view_info(const schema& base_schema) const;
friend bool operator==(const view_info& x, const view_info& y) {
return x._raw == y._raw;