Compare commits

..

1740 Commits

Author SHA1 Message Date
Yaron Kaikov
4ae9a56466 release: prepare for 4.0.11 2020-10-26 18:12:47 +02:00
Avi Kivity
0374c1d040 Update seastar submodule
* seastar 065a40b34a...748428930a (1):
  > append_challenged_posix_file_impl: allow destructing file with no queued work

Fixes #7285.
2020-10-19 15:06:24 +03:00
Botond Dénes
9cb0fe3b33 reader_permit: reader_resources: make true RAII class
Currently in all cases we first deduct the to-be-consumed resources,
then construct the `reader_resources` class to protect it (release it on
destruction). This is error prone as it relies on no exception being
thrown while constructing the `reader_resources`. Albeit the
`reader_resources` constructor is `noexcept` right now this might change
in the future and as the call sites relying on this are disconnected
from the declaration, the one modifying them might not notice.
To make this safe going forward, make the `reader_resources` a true RAII
class, consuming the units in its constructor and releasing them in its
destructor.

Refs: #7256

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200922150625.1253798-1-bdenes@scylladb.com>
(cherry picked from commit a0107ba1c6)
Message-Id: <20200924081408.236353-1-bdenes@scylladb.com>
2020-10-19 15:05:13 +03:00
Takuya ASADA
a813ff4da2 install.sh: set LC_ALL=en_US.UTF-8 on python3 thunk
scylla-python3 causes segfault when non-default locale specified.
As workaround for this, we need to set LC_ALL=en_US.UTF_8 on python3 thunk.

Fixes #7408

Closes #7414

(cherry picked from commit ff129ee030)
2020-10-18 15:03:04 +03:00
Avi Kivity
d5936147f4 Merge "materialized views: Fix undefined behavior on base table schema changes" from Tomasz
"
The view_info object, which is attached to the schema object of the
view, contains a data structure called
"base_non_pk_columns_in_view_pk". This data structure contains column
ids of the base table so is valid only for a particular version of the
base table schema. This data structure is used by materialized view
code to interpret mutations of the base table, those coming from base
table writes, or reads of the base table done as part of view updates
or view building.

The base table schema version of that data structure must match the
schema version of the mutation fragments, otherwise we hit undefined
behavior. This may include aborts, exceptions, segfaults, or data
corruption (e.g. writes landing in the wrong column in the view).

Before this patch, we could get schema version mismatch here after the
base table was altered. That's because the view schema did not change
when the base table was altered.

Another problem was that view building was using the current table's schema
to interpret the fragments and invoke view building. That's incorrect for two
reasons. First, fragments generated by a reader must be accessed only using
the reader's schema. Second, base_non_pk_columns_in_view_pk of the recorded
view ptrs may not longer match the current base table schema, which is used
to generate the view updates.

Part of the fix is to extract base_non_pk_columns_in_view_pk into a
third entity called base_dependent_view_info, which changes both on
base table schema changes and view schema changes.

It is managed by a shared pointer so that we can take immutable
snapshots of it, just like with schema_ptr. When starting the view
update, the base table schema_ptr and the corresponding
base_dependent_view_info have to match. So we must obtain them
atomically, and base_dependent_view_info cannot change during update.

Also, whenever the base table schema changes, we must update
base_dependent_view_infos of all attached views (atomically) so that
it matches the base table schema.

Fixes #7061.

Tests:

  - unit (dev)
  - [v1] manual (reproduced using scylla binary and cqlsh)
"

* tag 'mv-schema-mismatch-fix-v2' of github.com:tgrabiec/scylla:
  db: view: Refactor view_info::initialize_base_dependent_fields()
  tests: mv: Test dropping columns from base table
  db: view: Fix incorrect schema access during view building after base table schema changes
  schema: Call on_internal_error() when out of range id is passed to column_at()
  db: views: Fix undefined behavior on base table schema changes
  db: views: Introduce has_base_non_pk_columns_in_view_pk()

(cherry picked from commit 3daa49f098)
2020-10-06 17:12:28 +03:00
Juliusz Stasiewicz
a3d3b4e185 tracing: Fix error on slow batches
`trace_keyspace_helper::make_slow_query_mutation_data` expected a
"query" key in its parameters, which does not appear in case of
e.g. batches of prepared statements. This is example of failing
`record.parameters`:
```
...{"query[0]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"},
{"query[1]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"}...
```

In such case Scylla recorded no trace and said:
```
ERROR 2020-09-28 10:09:36,696 [shard 3] trace_keyspace_helper - No
"query" parameter set for a session requesting a slow_query_log record
```

Fix here is to leave query empty if not found. The users can still
retrieve the query contents from existing info.

Fixes #5843

Closes #7293

(cherry picked from commit 0afa738a8f)
2020-10-04 18:05:00 +03:00
Tomasz Grabiec
4ca2576c98 Merge "evictable_reader: validate buffer on reader recreation" from Botond
This series backports the evictable reader validation patchset (merged
as 97c99ea9f to master) to 4.1.

I only had to do changes to the tests.

Tests: unit(dev), some exception safety tests are failing with or
without my patchset

Fixes: #7208

* https://github.com/denesb/scylla.git denesb/evictable-reader-validate-buffer/backport-4.1:
  mutation_reader_test: add unit test for evictable reader self-validation
  evictable_reader: validate buffer after recreation the underlying
  evictable_reader: update_next_position(): only use peek'd position on partition boundary
  mutation_reader_test: add unit test for evictable reader range tombstone trimming
  evictable_reader: trim range tombstones to the read clustering range
  position_in_partition_view: add position_in_partition_view before_key() overload
  flat_mutation_reader: add buffer() accessor

(cherry picked from commit 7f3ffbc1c8)
2020-10-02 11:52:57 +02:00
Tomasz Grabiec
e99a0c7b89 schema: Fix race in schema version recalculation leading to stale schema version in gossip
Migration manager installs several feature change listeners:

    if (this_shard_id() == 0) {
        _feature_listeners.push_back(_feat.cluster_supports_view_virtual_columns().when_enabled(update_schema));
        _feature_listeners.push_back(_feat.cluster_supports_digest_insensitive_to_expiry().when_enabled(update_schema));
        _feature_listeners.push_back(_feat.cluster_supports_cdc().when_enabled(update_schema));
        _feature_listeners.push_back(_feat.cluster_supports_per_table_partitioners().when_enabled(update_schema));
    }

They will call update_schema_version_and_announce() when features are enabled, which does this:

    return update_schema_version(proxy, features).then([] (utils::UUID uuid) {
        return announce_schema_version(uuid);
    });

So it first updates the schema version and then publishes it via
gossip in announce_schema_version(). It is possible that the
announce_schema_version() part of the first schema change will be
deferred and will execute after the other four calls to
update_schema_version_and_announce(). It will install the old schema
version in gossip instead of the more recent one.

The fix is to serialize schema digest calculation and publishing.

Fixes #7200

(cherry picked from commit 1a57d641d1)
2020-10-01 18:18:53 +02:00
Yaron Kaikov
f8c7605657 release: prepare for 4.0.10 2020-09-28 20:33:24 +03:00
Avi Kivity
7b9e33dcd4 Update seastar submodule
* seastar e87ce4941c...065a40b34a (1):
  > lz4_fragmented_compressor: Fix buffer requirements

Fixes #6925.
2020-09-23 12:07:11 +03:00
Yaron Kaikov
d86a31097a release: prepare for 4.0.9 2020-09-17 14:24:32 +03:00
Nadav Har'El
bd9d6f8e45 alternator: fix corruption of PutItem operation in case of contention
This patch fixes a bug noted in issue #7218 - where PutItem operations
sometimes lose part of the item's data - some attributes were lost,
and the name of other attributes replaced by empty strings. The problem
happened when the write-isolation policy was LWT and there was contention
of writes to the same partition (not necessarily the same item).

To use CAS (a.k.a. LWT), Alternator builds an alternator::rmw_operation
object with an apply() function which takes the old contents of the item
(if needed) and a timestamp, and builds a mutation that the CAS should
apply. In the case of the PutItem operation, we wrongly assumed that apply()
will be called only once - so as an optimization the strings saved in the
put_item_operation were moved into the returned mutation. But this
optimization is wrong - when there is contention, apply() may be called
again when the changed proposed by the previous one was not accepted by
the Paxos protocol.

The fix is to change the one place where put_item_operation *moved* strings
out of the saved operations into the mutations, to be a copy. But to prevent
this sort of bug from reoccuring in future code, this patch enlists the
compiler to help us verify that it can't happen: The apply() function is
marked "const" - it can use the information in the operation to build the
mutation, but it can never modify this information or move things out of it,
so it will be fine to call this function twice.

The single output field that apply() does write (_return_attributes) is
marked "mutable" to allow the const apply() to write to it anyway. Because
apply() might be called twice, it is important that if some apply()
implementation sometimes sets _return_attributes, then it must always
set it (even if to the default, empty, value) on every call to apply().

The const apply() means that the compiler verfies for us that I didn't
forget to fix additional wrong std::move()s. Additionally, a test I wrote
to easily reproduce issue #7218 (which I will submit as a dtest later)
passes after this fix.

Fixes #7218.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200916064906.333420-1-nyh@scylladb.com>
(cherry picked from commit 5e8bdf6877)
2020-09-16 23:05:23 +03:00
Benny Halevy
11ef23e97a test: cql_query_test: test_cache_bypass: use table stats
test is currently flaky since system reads can happen
in the background and disturb the global row cache stats.

Use the table's row_cache stats instead.

Fixes #6773

Test: cql_query_test.test_cache_bypass(dev, debug)

Credit-to: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200811140521.421813-1-bhalevy@scylladb.com>
(cherry picked from commit 6deba1d0b4)
2020-09-16 18:20:30 +03:00
Asias He
2c0eac09ae migration_manager: Make sync_schema return error when node is down
sync_schema is supposed to make sure that this node knows about all
schema changes known by "nodes" that were made prior to this call.

Currently, when a node is down, the sync is sliently skipped.

To fix, add a flag to migration_task::run_may_throw to indicate that it
should fail if a node is down.

Fixes #4791

(cherry picked from commit 7ba821cbc0)
2020-09-16 16:01:44 +03:00
Dejan Mircevski
713a7269d0 cql3: Fix NULL reference in get_column_defs_for_filtering
There was a typo in get_column_defs_for_filtering(): it checked the
wrong pointer before dereferencing.  Add a test exposing the NULL
dereference and fix the typo.

Tests: unit (dev)

Fixes #7198.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
(cherry picked from commit 9d02f10c71)
2020-09-16 15:47:09 +03:00
Avi Kivity
1724301d4d reconcilable_result_builder: don't aggrevate out-of-memory condition during recovery
Consider an unpaged query that consumes all of available memory, despite
fea5067dfa which limits them (perhaps the
user raised the limit, or this is a system query). Eventually we will see a
bad_alloc which will abort the query and destroy this reconcilable_result_builder.

During destruction, we first destroy _memory_accounter, and then _result.
Destroying _memory_accounter resumes some continuations which can then
allocate memory synchronously when increasing the task queue to accomodate
them. We will then crash. Had we not crashed, we would immediately afterwards
release _result, freeing all the memory that we would ever need.

Fix by making _result the last member, so it is freed first.

Fixes #7240.

(cherry picked from commit 9421cfded4)
2020-09-16 15:41:10 +03:00
Avi Kivity
9971f2f5db Merge "Fix repair stalls in get_sync_boundary and apply_rows_on_master_in_thread" from Asias
"
This path set fixes stalls in repair that are caused by std::list merge and clear operations during test_latency_read_with_nemesis test.

Fixes #6940
Fixes #6975
Fixes #6976
"

* 'fix_repair_list_stall_merge_clear_v2' of github.com:asias/scylla:
  repair: Fix stall in apply_rows_on_master_in_thread and apply_rows_on_follower
  repair: Use clear_gently in get_sync_boundary to avoid stall
  utils: Add clear_gently
  repair: Use merge_to_gently to merge two lists
  utils: Add merge_to_gently

(cherry picked from commit 4547949420)
2020-09-10 13:15:01 +03:00
Avi Kivity
ee328c22ca repair: apply_rows_on_follower(): remove copy of repair_rows list
We copy a list, which was reported to generate a 15ms stall.

This is easily fixed by moving it instead, which is safe since this is
the last use of the variable.

Fixes #7115.

(cherry picked from commit 6ff12b7f79)
2020-09-10 11:53:55 +03:00
Avi Kivity
3a9c9a8a12 Update seastar submodule
* seastar 861b7edd61...e87ce4941c (1):
  > core/reactor: complete_timers(): restore previous scheduling group

Fixes #7184.
2020-09-07 11:28:55 +03:00
Raphael S. Carvalho
c03445871a compaction: Prevent non-regular compaction from picking compacting SSTables
After 8014c7124, cleanup can potentially pick a compacting SSTable.
Upgrade and scrub can also pick a compacting SSTable.
The problem is that table::candidates_for_compaction() was badly named.
It misleads the user into thinking that the SSTables returned are perfect
candidates for compaction, but manager still need to filter out the
compacting SSTables from the returned set. So it's being renamed.

When the same SSTable is compacted in parallel, the strategy invariant
can be broken like overlapping being introduced in LCS, and also
some deletion failures as more than one compaction process would try
to delete the same files.

Let's fix scrub, cleanup and ugprade by calling the manager function
which gets the correct candidates for compaction.

Fixes #6938.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811200135.25421-1-raphaelsc@scylladb.com>
(cherry picked from commit 11df96718a)
2020-09-06 18:41:12 +03:00
Takuya ASADA
565ac1b092 aws: update enhanced networking supported instance list
Sync enhanced networking supported instance list to latest one.

Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html

Fixes #6991

(cherry picked from commit 7cccb018b8)
2020-09-06 18:21:46 +03:00
Yaron Kaikov
7d1180b98f release: prepare for 4.0.8 2020-08-30 09:42:34 +03:00
Piotr Sarna
f258e6f6ee Merge 'counters: Fix filtering of counters' from Juliusz
Queries with `ALLOW FILTERING` and constraints on counter
values used to be rejected as "unimplemented". The reason
was a missing tri-comparator, which is added in this patch.

Fixes #5635

* jul-stas-5635-filtering-on-counters:
  cql/tests: Added test for filtering on counter columns
  counters: add comparator and remove `unimplemented` from restrictions

(cherry picked from commit c32faee657)
2020-08-27 18:42:30 +03:00
Avi Kivity
2708b0d664 Merge "repair: row_level: prevent deadlocks when repairing homogenous nodes" from Botond
"
This series backports the series "repair: row_level: prevent deadlocks
when repairing homogenous nodes" (merged as a9c7a1a86) to branch-4.1.
"

Fixes #6272

* 'repair-row-level-evictable-local-reader/branch-4.1' of https://github.com/denesb/scylla:
  repair: row_level: destroy reader on EOS or error
  repair: row_level: use evictable_reader for local reads
  mutation_reader: expose evictable_reader
  mutation_reader: evictable_reader: add auto_pause flag
  mutation_reader: make evictable_reader a flat_mutation_reader
  mutation_reader: s/inactive_shard_read/inactive_evictable_reader/
  mutation_reader: move inactive_shard_reader code up
  mutation_reader: fix indentation
  mutation_reader: shard_reader: extract remote_reader as evictable_reader
  mutation_reader: reader_lifecycle_policy: make semaphore() available early

(cherry picked from commit 59aa1834a7)
2020-08-27 17:44:27 +03:00
Asias He
e31ffbf2e6 compaction_manager: Avoid stall in perform_cleanup
The following stall was seen during a cleanup operation:

scylla: Reactor stalled for 16262 ms on shard 4.

| std::_MakeUniq<locator::tokens_iterator_impl>::__single_object std::make_unique<locator::tokens_iterator_impl, locator::tokens_iterator_impl&>(locator::tokens_iterator_impl&) at /usr/include/fmt/format.h:1158
|  (inlined by) locator::token_metadata::tokens_iterator::tokens_iterator(locator::token_metadata::tokens_iterator const&) at ./locator/token_metadata.cc:1602
| locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at simple_strategy.cc:?
|  (inlined by) locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at ./locator/simple_strategy.cc:56
| locator::abstract_replication_strategy::get_ranges(gms::inet_address, locator::token_metadata&) const at /usr/include/fmt/format.h:1158
| locator::abstract_replication_strategy::get_ranges(gms::inet_address) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_ranges_for_endpoint(seastar::basic_sstring<char, unsigned int, 15u, true> const&, gms::inet_address const&) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_local_ranges(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const at /usr/include/fmt/format.h:1158
|  (inlined by) operator() at ./sstables/compaction_manager.cc:691
|  (inlined by) _M_invoke at /usr/include/c++/9/bits/std_function.h:286
| std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>::operator()(table const&) const at /usr/include/fmt/format.h:1158
|  (inlined by) compaction_manager::rewrite_sstables(table*, sstables::compaction_options, std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>) at ./sstables/compaction_manager.cc:604
| compaction_manager::perform_cleanup(table*) at /usr/include/fmt/format.h:1158

To fix, we furturize the function to get local ranges and sstables.

In addition, this patch removes the dependency to global storage_service object.

Fixes #6662

(cherry picked from commit 07e253542d)
2020-08-27 13:11:39 +03:00
Raphael S. Carvalho
801994e299 sstables: optimize procedure that checks if a sstable needs cleanup
needs_cleanup() returns true if a sstable needs cleanup.

Turns out it's very slow because it iterates through all the local
ranges for all sstables in the set, making its complexity:
	O(num_sstables * local_ranges)

We can optimize it by taking into account that abstract_replication_strategy
documents that get_ranges() will return a list of ranges that is sorted
and non-overlapping. Compaction for cleanup already takes advantage of that
when checking if a given partition can be actually purged.

So needs_cleanup() can be optimized into O(num_sstables * log(local_ranges)).

With num_sstables=1000, RF=3, then local_ranges=256(num_tokens)*3, it means
the max # of checks performed will go from 768000 to ~9584.

Fixes #6730.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-2-raphaelsc@scylladb.com>
(cherry picked from commit cf352e7c14)
2020-08-27 13:11:37 +03:00
Raphael S. Carvalho
3b932078bf sstables: export needs_cleanup()
May be needed elsewhere, like in an unit test.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-1-raphaelsc@scylladb.com>
(cherry picked from commit a9eebdc778)
2020-08-27 13:11:24 +03:00
Asias He
608f62a0e9 abstract_replication_strategy: Add get_ranges_in_thread
Add a version that runs inside a seastar thread. The benefit is that
get_ranges can yield to avoid stalls.

Refs #6662

(cherry picked from commit 94995acedb)
2020-08-27 13:10:32 +03:00
Asias He
d8619d3320 abstract_replication_strategy: Add get_ranges which takes token_metadata
It is useful when the caller wants to calculate ranges using a
custom token_metadata.

It will be used soon in do_rebuild_replace_with_repair for replace
operation.

Refs: #5482
(cherry picked from commit b640614aa6)
2020-08-27 13:10:26 +03:00
Asias He
4f0c99a187 gossip: Fix race between shutdown message handler and apply_state_locally
1. The node1 is shutdown
2. The node1 sends shutdown message to node2
3. The node2 receives gossip shutdown message but the handler yields
4. The node1 is restarted
5. The node1 sends new gossip endpoint_state to node2, node2 applies the state
   in apply_state_locally and calls gossiper::handle_major_state_change
   and then calls gossiper::mark_alive
6. The shutdown message handler in step 3 resumes and sets status of node1 to SHUTDOWN
7. The gossiper::mark_alive fiber in step 5 resumes and calls gossiper::real_mark_alive,
   node2 will skip to mark node1 as alive because the status of node1 is
   SHUTDOWN. As a result, node1 is alive but it is not marked as UP by node2.

To fix, we serialize the two operations.

Fixes #7032

(cherry picked from commit e6ceec1685)
2020-08-27 11:16:10 +03:00
Nadav Har'El
ada79df082 alternator test: configurable temporary directory
The test/alternator/run script creates a temporary directory for the Scylla
database in /tmp. The assumption was that this is the fastest disk (usually
even a ramdisk) on the test machine, and we didn't need anything else from
it.

But it turns out that on some systems, /tmp is actually a slow disk, so
this patch adds a way to configure the temporary directory - if the TMPDIR
environment variable exists, it is used instead of /tmp. As before this
patch, a temporary subdirectry is created in $TMPDIR, and this subdirectory
is automatically deleted when the test ends.

The test.py script already passes an appropriate TMPDIR (testlog/$mode),
which after this patch the Alternator test will use instead of /tmp.

Fixes #6750

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200713193023.788634-1-nyh@scylladb.com>
(cherry picked from commit 8e3be5e7d6)
2020-08-26 19:48:45 +03:00
Nadav Har'El
1935f2b480 alternator: fix order conditions on binary attributes
We implemented the order operators (LT, GT, LE, GE, BETWEEN) incorrectly
for binary attributes: DynamoDB requires that the bytes be treated as
unsigned for the purpose of order (so byte 128 is higher than 127), but
our implementation uses Scylla's "bytes" type which has signed bytes.

The solution is simple - we can continue to use the "bytes" type, but
we need to use its compare_unsigned() function, not its "<" operator.

This bug affected conditional operations ("Expected" and
"ConditionExpression") and also filters ("QueryFilter", "ScanFilter",
"FilterExpression"). The bug did *not* affect Query's key conditions
("KeyConditions", "KeyConditionExpression") because those already
used Scylla's key comparison functions - which correctly compare binary
blobs as unsigned bytes (in fact, this is why we have the
compare_unsigned() function).

The patch also adds tests that reproduce the bugs in conditional
operations, and show that the bug did not exist in key conditions.

Fixes #6573

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200603084257.394136-1-nyh@scylladb.com>
(cherry picked from commit f6b1f45d69)
Manyally removed tests in test_key_conditions.py which didn't exist in this branch.
2020-08-26 19:28:47 +03:00
Avi Kivity
44a76ed231 Merge "Unregister RPC verbs on stop" from Pavel E
"
There are 5 services, that register their RPC handlers in messaging
service, but quite a few of them unregister them on stop.

Unregistering is somewhat critical, not just because it makes the
code look clean, but also because unregistration does wait for the
message processing to complete, thus avoiding use-after-free's in
the handlers.

In particular, several handlers call service::get_schema_for_write()
which, in turn, may end up in service::maybe_sync() calling for
the local migration manager instance. All those handlers' processing
must be waited for before stopping the migration manager.

The set brings the RPC handlers unregistration in sync with the
registration part.

tests: unit (dev)
       dtest (dev: simple_boot_shutdown, repair)
       start-stop by hands (dev)
fixes: #6904
"

* 'br-rpc-unregister-verbs' of https://github.com/xemul/scylla:
  main: Add missing calls to unregister RPC hanlers
  messaging: Add missing per-service unregistering methods
  messaging: Add missing handlers unregistration helpers
  streaming: Do not use db->invoke_on_all in vain
  storage_proxy: Detach rpc unregistration from stop
  main: Shorten call to storage_proxy::init_messaging_service

(cherry picked from commit 01b838e291)
2020-08-26 14:42:40 +03:00
Raphael S. Carvalho
aeb49f4915 cql3/statements: verify that counter column cannot be added into non-counter table
A check, to validate that counter column cannot be added into non-counter table,
is missing for alter table statement. Validation is performed when building new
schema, but it's limited to checking that a schema will not contain both counter
and non-counter columns.

Due to lack of validation, the added counter column could be incorrectly
persisted to the schema, but this results in a crash when setting the new
schema to its table. On restart, it can be confirmed that the schema change
was indeed persisted when describing the table.
This problem is fixed by doing proper validation for the alter table statement,
which consists of making sure a new counter column cannot be added to a
non-counter table.

The test cdc_disallow_cdc_for_counters_test is adjusted because one of its tests
was built on the assumption that counter column can be added into a non-counter
table.

Fixes #7065.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200824155709.34743-1-raphaelsc@scylladb.com>
(cherry picked from commit 1c29f0a43d)
2020-08-25 18:46:01 +03:00
Gleb Natapov
8d6b35ad20 lwt: fix possible leak of "prune" counter
If get_schema_for_read() fails "prune" counter will not be decremented.
The patch fixes it by creating RAI object earlier. Also return releasing
of a mutation in release_mutation() which was dropped by mistake.

Fixes #6124

Message-Id: <20200405080233.GA22509@scylladb.com>
(cherry picked from commit e5f7ccc4c8)
2020-08-23 19:29:06 +03:00
Takuya ASADA
b123700ebe dist/debian: disable debuginfo compression on .deb
Since older binutils on some distribution does not able to handle
compressed debuginfo generated on Fedora, we need to disable it.
However, debian packager force debuginfo compression since debian/compat = 9,
we have to uncompress them after compressed automatically.

Fixes #6982

(cherry picked from commit 75c2362c95)
2020-08-23 19:03:13 +03:00
Botond Dénes
6786b521f9 scylla-gdb.py: find_db(): don't return current shard's database for shard=0
The `shard` parameter of `find_db()` is optional and is defaulted to
`None`. When missing, the current shard's database instance is returned.
The problem is that the if condition checking this uses `not shard`,
which also evaluates to `True` if `shard == 0`, resulting in returning
the current shard's database instance for shard 0. Change the condition
to `shard is None` to avoid this.

Fixes: #7016
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200812091546.1704016-1-bdenes@scylladb.com>
(cherry picked from commit 4cfab59eb1)
2020-08-23 18:56:39 +03:00
Botond Dénes
fda0d1ae8e table: get_sstables_by_partition_key(): don't make a copy of selected sstables
Currently we assign the reference to the vector of selected sstables to
`auto sst`. This makes a copy and we pass this local variable to
`do_for_each()`, which will result in a use-after-free if the latter
defers.
Fix by not making a copy and instead just keep the reference.

Fixes: #7060

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200818091241.2341332-1-bdenes@scylladb.com>
(cherry picked from commit 78f94ba36a)
2020-08-19 00:02:22 +03:00
Yaron Kaikov
e7cffb978a release: prepare for 4.0.7 2020-08-17 00:38:43 +03:00
Benny Halevy
79a1c74921 db::commitlog: close file if wrapping failed
When I/O error (e.g. EMFILE / ENOSPC) happens we hit
an assert in ~append_challenged_posix_file_impl(): Assertion _closing_state == state::closed' failed.

Commit 6160b9017d add close on failure
of the lamda defined in allocate_segment_ex, but it doesn't handle an error
after the file is opened/created while it is wrapped with commitlog_file_extensions.

Refs #5657

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Calle Wilund <calle@scylladb.com>
Message-Id: <20200414115231.298632-1-bhalevy@scylladb.com>
(cherry picked from commit 35892e4557)
2020-08-16 19:58:23 +03:00
Calle Wilund
3ee854f9fc cdc::log: Missing "preimage" check in row deletion pre-image
Fixes #6561

Pre-image generation in row deletion case only checked if we had a pre-image
result set row. But that can be from post-image. Also check actual existance
of the pre-image CK.
Message-Id: <20200608132804.23541-1-calle@scylladb.com>

(cherry picked from commit 5105e9f5e1)
2020-08-12 13:55:10 +03:00
Avi Kivity
2b65984d14 Merge "Fix GCC-10 related bugs and fix deletion of temporary garbage-collected sstables" from Raphael
"
Temporary garbage-collected SSTables, involved in the incremental
compaction process which can be enabled for LCS, were incorrectly
invalidating the cache when added to the set of SSTables. Also, those
same temporary SSTables could be incorrectly removed, causing deletion
warnings. The patchset "Don't invalidate row cache when adding GC
SSTable" fixes those two issues by using the SSTable replacement
mechanism, which is the correct method for replacing SSTables in the
set.
"

* 'backport_fix_issue_6275_for_branch_4_0' of github.com:raphaelsc/scylla:
  row_cache_alloc_stress_test: Make sure GCC can't delete a new
  tests: Wait for a few futures
  sstables/compaction: Don't invalidate row cache when adding GC SSTable to SSTable set
  sstables/compaction: Change meaning of compaction_completion_desc input and output fields
  sstables/compaction: Clean up code around garbage_collected_sstable_writer
  compaction: enhance compaction_descriptor with creator and replace function
2020-08-11 18:16:41 +03:00
Nadav Har'El
52d1099d09 Update Seastar submodule
> http: add "Expect: 100-continue" handling

Fixes #6844
2020-08-11 13:33:45 +03:00
Asias He
3a03906377 repair: Switch to btree_set for repair_hash.
In one of the longevity tests, we observed 1.3s reactor stall which came from
repair_meta::get_full_row_hashes_source_op. It traced back to a call to
std::unordered_set::insert() which triggered big memory allocation and
reclaim.

I measured std::unordered_set, absl::flat_hash_set, absl::node_hash_set
and absl::btree_set. The absl::btree_set was the only one that seastar
oversized allocation checker did not warn in my tests where around 300K
repair hashes were inserted into the container.

- unordered_set:
hash_sets=295634, time=333029199 ns

- flat_hash_set:
hash_sets=295634, time=312484711 ns

- node_hash_set:
hash_sets=295634, time=346195835 ns

- btree_set:
hash_sets=295634, time=341379801 ns

The btree_set is a bit slower than unordered_set but it does not have
huge memory allocation. I do not measure real difference of total time
to finish repair of the same dataset with unordered_set and btree_set.

To fix, switch to absl btree_set container.

Fixes #6190

(cherry picked from commit 67f6da6466)
(cherry picked from commit a27188886a)
2020-08-11 12:35:34 +03:00
Rafael Ávila de Espíndola
2395a240b4 build: Link with abseil
It is a pity we have to list so many libraries, but abseil doesn't
provide a .pc file.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
(cherry picked from commit 7d1f6725dd)

Ref #6190.
2020-08-11 12:35:32 +03:00
Rafael Ávila de Espíndola
d182c595a1 Add abseil as a submodule
This adds the https://abseil.io library as a submodule. The patch
series that follows needs a hash table that supports heterogeneous
lookup, and abseil has a really good hash table that supports that
(https://abseil.io/blog/20180927-swisstables).

The library is still not available in Fedora, but it is fairly easy to
use it directly from a submodule.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
(cherry picked from commit 383a9c6da9)

Ref #6190
2020-08-11 12:35:31 +03:00
Rafael Ávila de Espíndola
fe9c4611b3 cofigure: Don't overwrite seastar_cflags
The variable seastar_cflags was being used for flags passed to seastar
and for flags extracted from the seastar.pc file.

This introduces a new variable for the flags extracted from the
seastar.pc file.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
(cherry picked from commit 2ad09aefb6)

Ref #6190.
2020-08-11 12:35:28 +03:00
Calle Wilund
29df416720 database: Do not assert on replay positions if truncate does not flush
Fixes #6995

In c2c6c71 the assert on replay positions in flushed sstables discarded by
truncate was broken, by the fact that we no longer flush all sstables
unless auto snapshot is enabled.

This means the low_mark assertion does not hold, because we maybe/probably
never got around to creating the sstables that would hold said mark.

Note that the (old) change to not create sstables and then just delete
them is in itself good. But in that case we should not try to verify
the rp mark.

(cherry picked from commit 9620755c7f)
2020-08-10 23:28:00 +03:00
Nadav Har'El
1d3c00572c Update Seastar submodule with some backported fixes
Fixes #7008
  > futures_test: Don't use * on an optional without a value
  > net: Use offsetof instead of accessing a null pointer
  > allocator_test: Avoid undefined conversion
  > http: Don't use moved value
  > circular_buffer_fixed_capacity_test: Fix indentation
  > circular_buffer_fixed_capacity: Always mask indexes
  > rpc: Fix a use-after-return
2020-08-10 20:39:35 +03:00
Avi Kivity
9d6e2c5a71 Update seastar submodule
* seastar 4ee384e15f...2dbd81d5db (1):
  > memory: fix small aligned free memory corruption

Fixes #6831
2020-08-09 18:39:01 +03:00
Pavel Emelyanov
386741e3b7 storage_proxy_stats: Make get_ep_stat() noexcept
The .get_ep_stat(ep) call can throw when registering metrics (we have
issue for it, #5697). This is not expected by it callers, in particular
abstract_write_response_handler::timeout_cb breaks in the middle and
doesn't call the on_timeout() and the _proxy->remove_response_handler(),
which results in not removed and not released responce handler. In turn
not released response handler doesn't set the _ready future on which
response_wait() waits -> stuck.

Although the issue with .get_ep_stat() should be fixed, an exception in
it mustn't lead to deadlocks, so the fix is to make the get_ep_stat()
noexcept by catching the exception and returning a dummy stat object
instead to let caller(s) finish.

Fixes #5985
Tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200430163639.5242-1-xemul@scylladb.com>
(cherry picked from commit 513ce1e6a5)
2020-08-09 18:18:50 +03:00
Avi Kivity
d0fdc3960a Merge 'hinted handoff: fix commitlog memory leak' from Piotr D
"
When commitlog is recreated in hints manager, only shutdown() method is
called, but not release(). Because of that, some internal commitlog
objects (`segment_manager` and `segment`s) may be left pointing to each
other through shared_ptr reference cycles, which may result in memory
leak when the parent commitlog object is destroyed.

This PR prevents memory leaks that may happen this way by calling
release() after shutdown() from the hints manager.

Fixes: #6409, Fixes #6776
"

* piodul-fix-commitlog-memory-leak-in-hinted-handoff:
  hinted handoff: disable warnings about segments left on disk
  hinted handoff: release memory on commitlog termination

(cherry picked from commit 4c221855a1)
2020-08-09 17:26:17 +03:00
Tomasz Grabiec
4035cf4f9f thrift: Fix crash on unsorted column names in SlicePredicate
The column names in SlicePredicate can be passed in arbitrary order.
We converted them to clustering ranges in read_command preserving the
original order. As a result, the clustering ranges in read command may
appear out of order. This violates storage engine's assumptions and
lead to undefined behavior.

It was seen manifesting as a SIGSEGV or an abort in sstable reader
when executing a get_slice() thrift verb:

scylla: sstables/consumer.hh:476: seastar::future<> data_consumer::continuous_data_consumer<StateProcessor>::fast_forward_to(size_t, size_t) [with StateProcessor = sstables::data_consume_rows_context_m; size_t = long unsigned int]: Assertion `end >= _stream_position.position' failed.

Fixes #6486.

Tests:

   - added a new dtest to thrift_tests.py which reproduces the problem

Message-Id: <1596725657-15802-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit bfd129cffe)
2020-08-08 19:48:46 +03:00
Rafael Ávila de Espíndola
09367742b1 row_cache_alloc_stress_test: Make sure GCC can't delete a new
We want to test that a std::bad_alloc is thrown, but GCC 10 has a new
optimization (-fallocation-dce) that removes dead allocations.

This patch assigns the value returned by new to a global so that GCC
cannot delete it.

With this all tests in a dev build pass with GCC 10.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200424201531.225807-1-espindola@scylladb.com>
(cherry picked from commit 0d89bbd57f)
2020-08-07 16:49:33 -03:00
Rafael Ávila de Espíndola
a18ff57b29 tests: Wait for a few futures
GCC 10 now warns on these. This fixes the dev build with gcc 10.

backport note: remove unneeded change which is not compatible
with the branch in error_injection_test.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200424161006.17857-1-espindola@scylladb.com>
(cherry picked from commit 543a9ebd9b)
2020-08-07 16:32:12 -03:00
Raphael S. Carvalho
4734ba21a7 sstables/compaction: Don't invalidate row cache when adding GC SSTable to SSTable set
Garbage collected SSTable is incorrectly added to SSTable set with a function
that invalidates row cache. This problem is fixed by adding GC SStable
to set using mechanism which replaces old sstables with new sstables.

Also, adding GC SSTable to set in a separate call is not correct.
We should make sure that GC SSTable reaches the SSTable set at the same time
its respective old (input) SSTable is removed from the set, and that's done
using a single request call to table.

Fixes #5956.
Fixes #6275.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit a214ccdf89)
2020-08-06 19:08:46 -03:00
Raphael S. Carvalho
425af4c543 sstables/compaction: Change meaning of compaction_completion_desc input and output fields
input_sstables is renamed to old_sstables and is about old SSTables that should be
deleted and removed from the SSTable set.
output_sstables is renamed to new_sstables and is about new SSTable that should be
added to the SSTable set, replacing the old ones.

This will allow us, for example, to add auxiliary SSTables to SSTable set using
the same call which replaces output SSTables by input SSTables in compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 8f4458f1d5)
2020-08-06 18:51:21 -03:00
Raphael S. Carvalho
55f096d01b sstables/compaction: Clean up code around garbage_collected_sstable_writer
This cleanup allows us to get rid of the ugly compaction::create_new_sstable(),
and reduce complexity by getting rid of observable.

garbage_collected_sstable_writer::data is introduced to allow compaction to
directly communicate with the GC writer, which is stored in mutation_compaction,
making it unreachable after the compaction has started. By making compaction
store GC writer's data and using that same data to create g__c__s__w,
compaction is able to communicate with GC writer without the complexity of
observable utility. This move is important for the subsequent work which
will fix a couple of issues regarding management of GC SSTables.

[Backport note: there were a few conflicts as this patch was written after
interposer consumer, but the conflicts weren't hard to solve]

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit cc5e0d8da8)
2020-08-06 18:01:12 -03:00
Glauber Costa
fc79da5912 compaction: enhance compaction_descriptor with creator and replace function
There are many differences between resharding and compaction that are
artificial, arising more from the way we ended up implementing it than
necessity. This patch attempts to pass the creator and replacer functions
through the compaction_descriptor.

There is a difference between the creator function for resharding and
regular compaction: resharding has to pass the shard number on behalf
of which the SSTable is created. However regular compactions can just
ignore this. No need to have a special path just for this.

After this is done, the constructor for the compaction object can be
greatly simplified. In further patches I intend to simplify it a bit
further, but some more cleanup has to happen first.

To make that happen we have to construct a compaction_descriptor object
inside the resharding function. This is temporary: resharding currently
works with a descriptor, but at some point that descriptor is lost and
broken into pieces to be passed to this function. The overarching goal
of this work is exactly to be able to keep that descriptor for as long
as possible, which should simplify things a lot.

Callers are patched, but there are plenty for sstable_datafile_test.cc.
For their benefit, a helper function is provided to keep the previous
signature (test only).

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit e8801cd77b)
2020-08-06 17:45:40 -03:00
Yaron Kaikov
da9e7080ca release: prepare for 4.0.6 2020-08-06 14:18:11 +03:00
Takuya ASADA
01b0195c22 scylla_util.py: always use relocatable CLI tools
On some CLI tools, command options may different between latest version
vs older version.
To maximize compatibility of setup scripts, we should always use
relocatable CLI tools instead of distribution version of the tool.

Related #6954

(cherry picked from commit a19a62e6f6)
2020-08-03 10:42:14 +03:00
Takuya ASADA
d05b567a40 create-relocatable-package.py: add lsblk for relocatable CLI tools
We need latest version of lsblk that supported partition type UUID.

Fixes #6954

(cherry picked from commit 6ba2a6c42e)
2020-08-03 10:42:12 +03:00
Juliusz Stasiewicz
2c11efbbae aggregate_fcts: Use per-type comparators for dynamic types
For collections and UDTs the `MIN()` and `MAX()` functions are
generated on the fly. Until now they worked by comparing just the
byte representations of arguments.

This patch uses specific per-type comparators to provide semantically
sensible, dynamically created aggregates.

Fixes #6768

(cherry picked from commit 5b438e79be)
2020-08-03 10:26:28 +03:00
Calle Wilund
c60d71dc69 cql3::lists: Fix setter_by_uuid not handing null value
Fixes #6828

When using the scylla list index from UUID extension,
null values were not handled properly causing throws
from underlying layer.

(cherry picked from commit 3b74b9585f)
2020-08-03 10:20:28 +03:00
Takuya ASADA
79930048db scylla_post_install.sh: generate memory.conf for CentOS7
On CentOS7, systemd does not support percentage-based parameter.
To apply memory parameter on CentOS7, we need to override the parameter
in bytes, instead of percentage.

Fixes #6783

(cherry picked from commit 3a25e7285b)
2020-07-30 16:41:40 +03:00
Tomasz Grabiec
82b4f4a6c2 commitlog: Fix use-after-free on mutation object during replay
The mutation object may be freed prematurely during commitlog replay
in the schema upgrading path. We will hit the problem if the memtable
is full and apply_in_memory() needs to defer.

This will typically manifest as a segfault.

Fixes #6953

Introduced in 79935df

Tests:
  - manual using scylla binary. Reproduced the problem then verified the fix makes it go away

Message-Id: <1596044010-27296-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3486eba1ce)
2020-07-30 16:37:08 +03:00
Avi Kivity
5b99195d21 dist: debian: do not require root during package build
Debian package builds provide a root environment for the installation
scripts, since that's what typical installation scripts expect. To
avoid providing actual root, a "fakeroot" system is used where syscalls
are intercepted and any effect that requires root (like chown) is emulated.

However, fakeroot sporadically fails for us, aborting the package build.
Since our install scripts don't really require root (when operating in
the --packaging mode), we can just tell dpkg-buildpackage that we don't
need fakeroot. This ought to fix the sporadic failures.

As a side effect, package builds are faster.

Fixes #6655.

(cherry picked from commit b608af870b)
2020-07-29 16:03:53 +03:00
Takuya ASADA
edde256228 scylla_setup: skip boot partition
On GCE, /dev/sda14 reported as unused disk but it's BIOS boot partition,
should not use for scylla data partition, also cannot use for it since it's
too small.

It's better to exclude such partiotion from unsed disk list.

Fixes #6636

(cherry picked from commit d7de9518fe)
2020-07-29 09:51:05 +03:00
Asias He
3cf28ac18e repair: Fix race between create_writer and wait_for_writer_done
We saw scylla hit user after free in repair with the following procedure during tests:

- n1 and n2 in the cluster

- n2 ran decommission

- n2 sent data to n1 using repair

- n2 was killed forcely

- n1 tried to remove repair_meta for n1

- n1 hit use after free on repair_meta object

This was what happened on n1:

1) data was received -> do_apply_rows was called -> yield before create_writer() was called

2) repair_meta::stop() was called -> wait_for_writer_done() / do_wait_for_writer_done was called
   with _writer_done[node_idx] not engaged

3) step 1 resumed, create_writer() was called and _repair_writer object was referenced

4) repair_meta::stop() finished, repair_meta object and its member _repair_writer was destroyed

5) The fiber created by create_writer() at step 3 hit use after free on _repair_writer object

To fix, we should call wait_for_writer_done() after any pending
operations were done which were protected by repair_meta::_gate. This
prevents wait for writer done finishes before the writer is in the
process of being created.

Fixes: #6853
Fixes: #6868
Backports: 4.0, 4.1, 4.2
(cherry picked from commit e6f640441a)
2020-07-29 09:51:02 +03:00
Raphael S. Carvalho
58b65f61c0 sstable: index_reader: Make sure streams are all properly closed on failure
Turns out the fix f591c9c710 wasn't enough to make sure all input streams
are properly closed on failure.
It only closes the main input stream that belongs to context, but it misses
all the input streams that can be opened in the consumer for promote index
reading. Consumer stores a list of indexes, where each of them has its own
input stream. On failure, we need to make sure that every single one of
them is properly closed before destroying the indexes as that could cause
memory corruption due to read ahead.

Fixes #6924.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200727182214.377140-1-raphaelsc@scylladb.com>
(cherry picked from commit 0d70efa58e)
2020-07-29 09:50:48 +03:00
Yaron Kaikov
466cfb0ca6 release: prepare for 4.0.5 2020-07-28 09:13:02 +03:00
Raphael S. Carvalho
1cd6f50806 table: Fix Staging SSTables being incorrectly added or removed from the backlog tracker
Staging SSTables can be incorrectly added or removed from the backlog tracker,
after an ALTER TABLE or TRUNCATE, because the add and removal don't take
into account if the SSTable requires view building, so a Staging SSTable can
be added to the tracker after a ALTER table, or removed after a TRUNCATE,
even though not added previously, potentially causing the backlog to
become negative.

Fixes #6798.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200716180737.944269-1-raphaelsc@scylladb.com>
(cherry picked from commit b67066cae2)
2020-07-21 12:57:41 +03:00
Asias He
3f6fe7328a repair: Relax size check of get_row_diff and set_diff
In case a row hash conflict, a hash in set_diff will get more than one
row from get_row_diff.

For example,

Node1 (Repair master):
row1  -> hash1
row2  -> hash2
row3  -> hash3
row3' -> hash3

Node2 (Repair follower):
row1  -> hash1
row2  -> hash2

We will have set_diff = {hash3} between node1 and node2, while
get_row_diff({hash3}) will return two rows: row3 and row3'. And the
error below was observed:

   repair - Got error in row level repair: std::runtime_error
   (row_diff.size() != set_diff.size())

In this case, node1 should send both row3 and row3' to peer node
instead of fail the whole repair. Because node2 does not have row3 or
row3', otherwise node1 won't send row with hash3 to node1 in the first
place.

Refs: #6252
(cherry picked from commit a00ab8688f)
2020-07-15 14:49:29 +03:00
Hagit Segev
f9dd8608eb release: prepare for 4.0.4 2020-07-14 14:10:39 +03:00
Avi Kivity
24a80cbf47 Update seastar submodule
* seastar a73b92ff2e...4ee384e15f (2):
  > futures: Add a test for a broken promise in a parallel_for_each
  > future: Call set_to_broken_promise earlier

Fixes #6749 (probably)
2020-07-13 20:32:27 +03:00
Dmitry Kropachev
6e4edc97ad dist/common/scripts/scylla-housekeeping: wrap urllib.request with try ... except
We could hit "cannot serialize '_io.BufferedReader' object" when request get 404 error from the server
	Now you will get legit error message in the case.

	Fixes #6690

(cherry picked from commit de82b3efae)
2020-07-09 18:25:35 +03:00
Dejan Mircevski
81df28b6f3 cql/restrictions: Handle WHERE a>0 AND a<0
WHERE clauses with start point above the end point were handled
incorrectly.  When the slice bounds are transformed to interval
bounds, the resulting interval is interpreted as wrap-around (because
start > end), so it contains all values above 0 and all values below
0.  This is clearly incorrect, as the user's intent was to filter out
all possible values of a.

Fix it by explicitly short-circuiting to false when start > end.  Add
a test case.

Fixes #5799.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
(cherry picked from commit 921dbd0978)
2020-07-08 13:25:06 +03:00
Juliusz Stasiewicz
ea6620e9eb counters: Read the state under timeout
Counter update is a RMW operation. Until now the "Read" part was
not guarded by a timeout, which is changed in this patch.

Fixes #5069

(cherry picked from commit e04fd9f774)
2020-07-07 20:45:26 +03:00
Takuya ASADA
19be84dafd scylla_setup: don't add same disk device twice
We shouldn't accept adding same disk twice for RAID prompt.

Fixes #6711

(cherry picked from commit 835e76fdfc)
2020-07-07 13:08:36 +03:00
Pavel Emelyanov
2ff897d351 main: Keep feature_service for storage_proxy
Fixes #6250

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200423165608.32419-1-xemul@scylladb.com>
(cherry picked from commit 98635b74a6)
2020-07-07 12:42:52 +03:00
Botond Dénes
8fc3300739 sstables: sstable_reader: fix read range upper bound calculation for reverse slices
The single-key sstable reader uses the clustering ranges from the slice
to determine the upper bound of the disk read-range using the index.
For this is simply uses the end bound of the last clustering ranges. For
reverse reads however the clustering ranges in the slice are in reverse
order, so this will in fact be the upper bound of the smallest range.
Depending on whether the distance between the clustering range is big
enough for the sstable reader to use the index to skip between them,
this will lead to either reading too little data or an assert failure.

This patch fixes the problematic function `get_slice_upper_bound()` to
consider reverse reads as well.

Initially I thought there will be more mishandling of reverse slices,
but actually `mutation_fragment_filter`, the component doing the actual
slicing of rows, is already reverse-slice aware.

A unit test which reproduces the assert failure is also added.

Fixes: #6171

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200507114956.271799-1-bdenes@scylladb.com>
(cherry picked from commit 791acc7f38)
2020-07-05 16:02:15 +03:00
Raphael S. Carvalho
d2ac7d4b18 compaction: Fix partition estimation with TWCS interposer
Max and min windows are microsecond timestamps, which should be divided
by window size in microseconds to properly estimate window count
based on provided mutation_source_metadata.

Found this problem after properly setting mutation_source_metadata with
min and max metadata on behalf of regular compaction.

Fixes #6214.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200409194235.6004-2-raphaelsc@scylladb.com>
(cherry picked from commit 3edff36cd2)
2020-07-05 15:27:40 +03:00
Avi Kivity
61706a6789 Update seastar submodule
* seastar 0dc0fec831...a73b92ff2e (1):
  > rpc::compressor: Fix static init fiasco with names

Fixes #5963
2020-07-02 18:08:52 +03:00
Piotr Sarna
65aa531010 db: set gc grace period to 0 for local system tables
Local system tables from `system` namespace use LocalStrategy
replication, so they do not need to be concerned about gc grace
period. Some system tables already set gc grace period to 0,
but other ones, including system.large_partitions, did not.
That may result in millions of tombstones being needlessly
kept for these tables, which can cause read timeouts.

Fixes #6325
Tests: unit(dev), local(running cqlsh and playing with system tables)

(cherry picked from commit bf5f247bc5)
2020-07-01 13:13:57 +03:00
Benny Halevy
4bffd0f522 api: storage_service: serialize true_snapshot_size
Following up on 91b71a0b1a
We also need to serialize storage_service::true_snapshots_size
with snapshot-modifying operations.

It seems like it was assumed that get_snapshot_details
is done under run_snapshot_list_operation, but the one called
here is the table method, not the api::storage_service::get_snapshot_details.

Fixes #5603

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200506115732.483966-1-bhalevy@scylladb.com>
(cherry picked from commit 682fb3acfd)
2020-07-01 13:09:43 +03:00
Rafael Ávila de Espíndola
9409fc7290 gms: Don't keep references to reallocated vector entries
These callbacks can block a seastar thread and the underlying vector
can be reallocated concurrently.

This is no different than if it was a plain std::vector and the
solution is similar: use values instead of references.

Fixes #6230

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200422182304.120906-1-espindola@scylladb.com>
(cherry picked from commit d8555513a9)
2020-07-01 12:58:56 +03:00
Pavel Solodovnikov
86faf1b3ca cql3: avoid using shared_ptr's in unrecognized_entity_exception
Using shared_ptr's in `unrecognized_entity_exception` can lead
to cross-cpu deletion of a pointer which will trigger an assert
`_cpu == std::this_thread::get_id()' when shared_ptr is disposed.

Copy `column_identifier` to the exception object and avoid using
an instance of `cql3::relation`: just get a string representation
from it since nothing more is used in associated exception
handling code.

Fixes: #6287
Tests: unit(dev, debug), dtest(lwt_destructive_ddl_test.py:LwtDestructiveDDLTest.test_rename_column)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200506155714.150497-1-pa.solodovnikov@scylladb.com>
(cherry picked from commit 1d3f9174c5)
2020-07-01 12:54:09 +03:00
Raphael S. Carvalho
426295bda9 compaction: Fix the 2x disk space requirement in SSTable upgrade
SSTable upgrade is requiring 2x the space of input SSTables because
we aren't releasing references of the SSTables that were already
upgraded. So if we're upgrading 1TB, it means that up to 2TB may be
required for the upgrade operation to succeed.

That can be fixed by moving all input SSTables when rewrite_sstables()
asks for the set of SSTables to be compacted, so allowing their space
to be released as soon as there is no longer any ref to them.

Spotted while auditting code.

Fixes #6682.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200619205701.92891-1-raphaelsc@scylladb.com>
(cherry picked from commit 52180f91d4)
2020-07-01 12:37:38 +03:00
Raphael S. Carvalho
c6fde0e562 cql3: don't reset default TTL when not explicitly specified in alter table statement
Any alter table statement that doesn't explicitly set the default time
to live will reset it to 0.

That can be very dangerous for time series use cases, which rely on
all data being eventually expired, and a default TTL of 0 means
data never being expired.

Fixes #5048.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200402211653.25603-1-raphaelsc@scylladb.com>
(cherry picked from commit 044f80b1b5)
2020-06-30 19:28:50 +03:00
Avi Kivity
d9f9e7455b Merge "Fix handling of decimals with negative scales" from Rafael
"
Before this series scylla would effectively infinite loop when, for
example, casting a decimal with a negative scale to float.

Fixes #6720
"

* 'espindola/fix-decimal-issue' of https://github.com/espindola/scylla:
  big_decimal: Add a test for a corner case
  big_decimal: Correctly handle negative scales
  big_decimal: Add a as_rational member function
  big_decimal: Move constructors out of line

(cherry picked from commit 3e2eeec83a)
2020-06-29 12:26:06 +03:00
Piotr Sarna
e95bcd0f8f alternator: fix propagating tags
Updating tags was erroneously done locally, which means that
the schema change was not propagated to other nodes.
The new code announces new schema globally.

Fixes #6513
Branches: 4.0,4.1
Tests: unit(dev)
       dtest(alternator_tests.AlternatorTest.test_update_condition_expression_and_write_isolation)
Message-Id: <3a816c4ecc33c03af4f36e51b11f195c231e7ce1.1592935039.git.sarna@scylladb.com>

(cherry picked from commit f4e8cfe03b)
2020-06-24 14:10:36 +03:00
Asias He
2ff6e2e122 streaming: Do not send end of stream in case of error
Current sender sends stream_mutation_fragments_cmd::end_of_stream to
receiver when an error is received from a peer node. To be safe, send
stream_mutation_fragments_cmd::error instead of
stream_mutation_fragments_cmd::end_of_stream to prevent end_of_stream to
be written into the sstable when a partition is not closed yet.

In addition, use mutation_fragment_stream_validator to valid the
mutation fragments emitted from the reader, e.g., check if
partition_start and partition_end are paired when the reader is done. If
not, fail the stream session and send
stream_mutation_fragments_cmd::error instead of
stream_mutation_fragments_cmd::end_of_stream to isolate the problematic
sstables on the sender node.

Refs: #6478
(cherry picked from commit a521c429e1)
2020-06-23 12:48:01 +03:00
Hagit Segev
1fcf38abd9 release: prepare for 4.0.3 2020-06-21 21:46:49 +03:00
Alejo Sanchez
3375b8b86c lwt: validate before constructing metadata
LWT batches conditions can't span multiple tables.
This was detected in batch_statement::validate() called in ::prepare().
But ::cas_result_set_metadata() was built in the constructor,
causing a bitset assert/crash in a reported scenario.
This patch moves validate() to the constructor before building metadata.

Closes #6332

Tested with https://github.com/scylladb/scylla-dtest/pull/1465

[avi: adjust spelling of exception message to 4.0 spelling]

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit d1521e6721)
2020-06-21 18:22:08 +03:00
Gleb Natapov
586546ab32 cql transport: do not log broken pipe error when a client closes its side of a connection abruptly
Fixes #5661

Message-Id: <20200615075958.GL335449@scylladb.com>
(cherry picked from commit 7ca937778d)
2020-06-21 13:09:10 +03:00
Amnon Heiman
e1d558cb01 api/storage_service.cc: stream result of token_range
The get token range API can become big which can cause large allocation
and stalls.

This patch replace the implementation so it would stream the results
using the http stream capabilities instead of serialization and sending
one big buffer.

Fixes #6297

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 7c4562d532)
2020-06-21 12:57:34 +03:00
Avi Kivity
b0a8f396b4 Update seastar submodule
* seastar 447aad8d78...0dc0fec831 (1):
  > membarrier: fix madvise(MADV_DONTNEED) failure and crash with --lock-memory

Fixes #6346.
2020-06-21 12:35:39 +03:00
Rafael Ávila de Espíndola
48e7ee374a configure: Reduce the dynamic linker path size
gdb has a SO_NAME_MAX_PATH_SIZE of 512, so we use that as the path
size.

Fixes: #6494

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200528202741.398695-2-espindola@scylladb.com>
(cherry picked from commit aa778ec152)
2020-06-21 12:27:19 +03:00
Piotr Sarna
3e85ecd1bd alternator: fix the return type of PutItem
Even if there are no attributes to return from PutItem requests,
we should return a valid JSON object, not an empty string.

Fixes #6568
Tests: unit(dev)

(cherry picked from commit 8fc3ca855e)
2020-06-21 12:21:30 +03:00
Piotr Sarna
930a4af8b3 alternator: fix returning UnprocessedKeys unconditionally
Client libraries (e.g. PynamoDB) expect the UnprocessedKeys
and UnprocessedItems attributes to appear in the response
unconditionally - it's hereby added, along with a simple test case.

Fixes #6569
Tests: unit(dev)

(cherry picked from commit 3aff52f56e)
2020-06-21 12:19:34 +03:00
Tomasz Grabiec
6a6d36058a row_cache: Fix undefined behavior on key linearization
This is relevant only when using partition or clustering keys which
have a representation in memory which is larger than 12.8 KB (10% of
LSA segment size).

There are several places in code (cache, background garbage
collection) which may need to linearize keys because of performing key
comparison, but it's not done safely:

 1) the code does not run with the LSA region locked, so pointers may
get invalidated on linearization if it needs to reclaim memory. This
is fixed by running the code inside an allocating section.

 2) LSA region is locked, but the scope of
with_linearized_managed_bytes() encloses the allocating section. If
allocating section needs to reclaim, linearization context will
contain invalidated pointers. The fix is to reorder the scopes so
that linearization context lives within an allocating section.

Example of 1 can be found in
range_populating_reader::handle_end_of_stream() where it performs a
lookup:

  auto prev = std::prev(it);
  if (prev->key().equal(*_cache._schema, *_last_key->_key)) {
     it->set_continuous(true);

but handle_end_of_stream() is not invoked under allocating section.

Example of 2 can be found in mutation_cleaner_impl::merge_some() where
it does:

  return with_linearized_managed_bytes([&] {
  ...
    return _worker_state->alloc_section(region, [&] {

Fixes #6637.
Refs #6108.

Tests:

  - unit (all)

Message-Id: <1592218544-9435-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit e81fc1f095)
2020-06-21 11:58:39 +03:00
Yaron Kaikov
ce57d0174d release: prepare for 4.0.2 2020-06-15 20:52:58 +03:00
Avi Kivity
cd11f210ad tools: toolchain: regenerate for gnutls 3.6.14
CVE-2020-13777.

Fixes #6627.

Toolchain source image registry disambiguated due to tighter podman defaults.
2020-06-15 07:58:31 +03:00
Calle Wilund
1e2e203cf0 gms::inet_address: Fix sign extension error in custom address formatting
Fixes #5808

Seems some gcc:s will generate the code as sign extending. Mine does not,
but this should be more correct anyhow.

Added small stringify test to serialization_test for inet_address

(cherry picked from commit a14a28cdf4)
2020-06-09 20:16:37 +03:00
Takuya ASADA
1a98c93a25 aws: update enhanced networking supported instance list
Sync enhanced networking supported instance list to latest one.

Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html

Fixes #6540

(cherry picked from commit 969c4258cf)
2020-06-09 16:02:27 +03:00
Calle Wilund
4f4845c94c commitlog_test: Ensure "when_over_disk_limit" reads segment list only once
Fixes #6195

test_commitlog_delete_when_over_disk_limit reads current segment list
in flush handler, to compare with result after allowing deletetion of
segement. However, it might be called more than once in rare cases,
because timing and us using rather small sizes.

Reading the list the second time however is not a good idea, because
it might just very well be exactly the same as what we read in the
test check code, and we actually overwrite the list we want to
check against. Because callback is on timer. And test is not.

Message-Id: <20200414114322.13268-1-calle@scylladb.com>
[ penberg: backported fix random failures in commitlog_test ]
(cherry picked from commit a62d75fed5)
2020-06-01 18:41:18 +03:00
Nadav Har'El
ef745e1ce7 alternator: fix support for bytes type in Query's KeyConditions
Our parsing of values in a KeyConditions paramter of Query was done naively.
As a result, we got bizarre error messages "condition not met: false" when
these values had incorrect type (this is issue #6490). Worse - the naive
conversion did not decode base64-encoded bytes value as needed, so
KeyConditions on bytes-typed keys did not work at all.

This patch fixes these bugs by using our existing utility function
get_key_from_typed_value(), which takes care of throwing sensible errors
when types don't match, and decoding base64 as needed.

Unfortunately, we didn't have test coverage for many of the KeyConditions
features including bytes keys, which is why this issue escaped detection.
A patch will follow with much more comprehensive tests for KeyConditions,
which also reproduce this issue and verify that it is fixed.

Refs #6490
Fixes #6495

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200524141800.104950-1-nyh@scylladb.com>
(cherry picked from commit 6b38126a8f)
2020-05-31 14:02:18 +03:00
Calle Wilund
ae32aa970a commitlog::read_log_file: Preserve subscription across reading
Fixes #6265

Return type for read_log_file was previously changed from
subscription to future<>, returning the previously returned
subscriptions result of done(). But it did not preserve the
subscription itself, which in turn will cause us to (in
work::stream), call back into a deleted object.

Message-Id: <20200422090856.5218-1-calle@scylladb.com>
(cherry picked from commit 525b283326)
2020-05-25 13:07:33 +03:00
Eliran Sinvani
a3eb12c5f1 Auth: return correct error code when role is not found
Scylla returns the wrong error code (0000 - server internal error)
in response to trying to do authentication/authorization operations
that involves a non-existing role.
This commit changes those cases to return error code 2200 (invalid
query) which is the correct one and also the one that Cassandra
returns.
Tests:
    Unit tests (Dev)
    All auth and auth_role dtests

(cherry picked from commit ce8cebe34801f0ef0e327a32f37442b513ffc214)

Fixes #6363.
2020-05-25 12:58:09 +03:00
Amnon Heiman
b5cedfc177 storage_service: get_range_to_address_map prevent use after free
The implementation of get_range_to_address_map has a default behaviour,
when getting an empty keypsace, it uses the first non-system keyspace
(first here is basically, just a keyspace).

The current implementation has two issues, first, it uses a reference to
a string that is held on a stack of another function. In other word,
there's a use after free that is not clear why we never hit.

The second, it calls get_non_system_keyspaces twice. Though this is not
a bug, it's redundant (get_non_system_keyspaces uses a loop, so calling
that function does have a cost).

This patch solves both issues, by chaning the implementation to hold a
string instead of a reference to a string.

Second, it stores the results from get_non_system_keyspaces and reuse
them it's more efficient and holds the returned values on the local
stack.

Fixes #6465

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 69a46d4179)
2020-05-25 12:48:26 +03:00
Hagit Segev
8d9bc57aca release: prepare for 4.0.1 2020-05-24 21:39:44 +03:00
Tomasz Grabiec
1cbda629a2 sstables: index_reader: Fix overflow when calculating promoted index end
When index file is larger than 4GB, offset calculation will overflow
uint32_t and _promoted_index_end will be too small.

As a result, promoted_index_size calculation will underflow and the
rest of the page will be interpretd as a promoted index.

The partitions which are in the remainder of the index page will not
be found by single-partition queries.

Data is not lost.

Introduced in 6c5f8e0eda.

Fixes #6040
Message-Id: <20200521174822.8350-1-tgrabiec@scylladb.com>

(cherry picked from commit a6c87a7b9e)
2020-05-24 09:45:55 +03:00
Rafael Ávila de Espíndola
baf0201a6e repair: Make sure sinks are always closed
In a recent next failure I got the following backtrace

    function=function@entry=0x270360 "seastar::rpc::sink_impl<Serializer, Out>::~sink_impl() [with Serializer = netw::serializer; Out = {repair_row_on_wire_with_cmd}]") at assert.c:101
    at ./seastar/include/seastar/core/shared_ptr.hh:463
    at repair/row_level.cc:2059

This patch changes a few functions to use finally to make sure the sink
is always closed.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200515202803.60020-1-espindola@scylladb.com>
(cherry picked from commit 311fbe2f0a)

Ref #6414
2020-05-20 09:00:44 +03:00
Asias He
7dcffb963c repair: Fix race between write_end_of_stream and apply_rows
Consider: n1, n2, n1 is the repair master, n2 is the repair follower.

=== Case 1 ===
1) n1 sends missing rows {r1, r2} to n2
2) n2 runs apply_rows_on_follower to apply rows, e.g., {r1, r2}, r1
   is written to sstable, r2 is not written yet, r1 belongs to
   partition 1, r2 belongs to partition 2. It yields after row r1 is
   written.
   data: partition_start, r1
3) n1 sends repair_row_level_stop to n2 because error has happened on n1
4) n2 calls wait_for_writer_done() which in turn calls write_end_of_stream()
   data: partition_start, r1, partition_end
5) Step 2 resumes to apply the rows.
   data: partition_start, r1, partition_end, partition_end, partition_start, r2

=== Case 2 ===
1) n1 sends missing rows {r1, r2} to n2
2) n2 runs apply_rows_on_follower to apply rows, e.g., {r1, r2}, r1
   is written to sstable, r2 is not written yet, r1 belongs to partition
   1, r2 belongs to partition 2. It yields after partition_start for r2
   is written but before _partition_opened is set to true.
   data: partition_start, r1, partition_end, partition_start
3) n1 sends repair_row_level_stop to n2 because error has happened on n1
4) n2 calls wait_for_writer_done() which in turn calls write_end_of_stream().
   Since _partition_opened[node_idx] is false, partition_end is skipped,
   end_of_stream is written.
   data: partition_start, r1, partition_end, partition_start, end_of_stream

This causes unbalanced partition_start and partition_end in the stream
written to sstables.

To fix, serialize the write_end_of_stream and apply_rows with a semaphore.

Fixes: #6394
Fixes: #6296
Fixes: #6414
(cherry picked from commit b2c4d9fdbc)
2020-05-20 08:08:11 +03:00
Piotr Dulikowski
dcfaf4d035 hinted handoff: don't keep positions of old hints in rps_set
When sending hints from one file, rps_set field in send_one_file_ctx
keeps track of commitlog positions of hints that are being currently
sent, or have failed to be sent. At the end of the operation, if sending
of some hints failed, we will choose position of the earliest hint that
failed to be sent, and will retry sending that file later, starting from
that position. This position is stored in _last_not_complete_rp.

Usually, this set has a bounded size, because we impose a limit of at
most 128 hints being sent concurrently. Because we do not attempt to
send any more hints after a failure is detected, rps_set should not have
more than 128 elements at a time.

Due to a bug, commitlog positions of old hints (older than
gc_grace_seconds of the destination table) were inserted into rps_set
but not removed after checking their age. This could cause rps_set to
grow very large when replaying a file with old hints.

Moreover, if the file mixed expired and non-expired hints (which could
happen if it had hints to two tables with different gc_grace_seconds),
and sending of some non-expired hints failed, then positions of expired
hints could influence calculation _last_not_complete_rp, and more hints
than necessary would be resent on the next retry.

This simple patch removes commitlog position of a hint from rps_set when
it is detected to be too old.

Fixes #6422

(cherry picked from commit 85d5c3d5ee)
2020-05-20 08:06:04 +03:00
Piotr Dulikowski
f974a54cbd hinted handoff: remove discarded hint positions from rps_set
Related commit: 85d5c3d

When attempting to send a hint, an exception might occur that results in
that hint being discarded (e.g. keyspace or table of the hint was
removed).

When such an exception is thrown, position of the hint will already be
stored in rps_set. We are only allowed to retain positions of hints that
failed to be sent and needed to be retried later. Dropping a hint is not
an error, therefore its position should be removed from rps_set - but
current logic does not do that.

Because of that bug, hint files with many discardable hints might cause
rps_set to grow large when the file is replayed. Furthermore, leaving
positions of such hints in rps_set might cause more hints than necessary
to be re-sent if some non-discarded hints fail to be sent.

This commit fixes the problem by removing positions of discarded hints
from rps_set.

Fixes #6433

(cherry picked from commit 0c5ac0da98)
2020-05-20 08:03:44 +03:00
Piotr Sarna
30a96cc592 db, view: remove duplicate entries from pending endpoints
When generating view updates, an endpoint can appear both
as a primary paired endpoint for the view update, and as a pending
endpoint (due to range movements). In order not to generate
the same update twice for the same endpoint, the paired endpoint
is removed from the list of pending endpoints if present.

Fixes #5459
Tests: unit(dev),
       dtest(TestMaterializedViews.add_dc_during_mv_insert_test)

(cherry picked from commit 86b0dd81e3)
2020-05-17 19:09:58 +02:00
Avi Kivity
faf300382a Update seastar submodule
* seastar 8bc24f486a...447aad8d78 (1):
  > timer: add scheduling_group awareness

Fixes #6170.
2020-05-10 18:12:32 +03:00
Gleb Natapov
55400598ff storage_proxy: limit read repair only to replicas that answered during speculative reads
Speculative reader has more targets that needed for CL. In case there is
a digest mismatch the repair runs between all of them, but that violates
provided CL. The patch makes it so that repair runs only between
replicas that answered (there will be CL of them).

Fixes #6123

Reviewed-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200402132245.GA21956@scylladb.com>
(cherry picked from commit 36a24bbb70)
2020-05-07 19:48:24 +03:00
Mike Goltsov
c177295bce fix error in fstrim service (scylla_util.py)
On Centos 7 machine:

fstrim.timer not enabled, only unmasked due scylla_fstrim_setup on installation
When trying run scylla-fstrim service manually you get error:

Traceback (most recent call last):
File "/opt/scylladb/scripts/libexec/scylla_fstrim", line 60, in <module>
main()
File "/opt/scylladb/scripts/libexec/scylla_fstrim", line 44, in main
cfg = parse_scylla_dirs_with_default(conf=args.config)
File "/opt/scylladb/scripts/scylla_util.py", line 484, in parse_scylla_dirs_with_default
if key not in y or not y[k]:
NameError: name 'k' is not defined

It caused by error in scylla_util.py

Fixes #6294.

(cherry picked from commit 068bb3a5bf)
2020-05-07 19:45:35 +03:00
Hagit Segev
d95aa77b62 release: prepare for 4.0.0 2020-05-05 18:58:39 +03:00
Pekka Enberg
fe54009855 scripts/jobs: Keep memory reserve when calculating parallelism
The "jobs" script is used to determine the amount of compilation
parallelism on a machine. It attempts to ensure each GCC process has at
least 4 GB of memory per core. However, in the worst case scenario, we
could end up having the GCC processes take up all the system memory,
forcin swapping or OOM killer to kick in. For example, on a 4 core
machine with 16 GB of memory, this worst case scenario seems easy to
trigger in practice.

Fix up the problem by keeping a 1 GB of memory reserve for other
processes and calculating parallelism based on that.

Message-Id: <20200423082753.31162-1-penberg@scylladb.com>
(cherry picked from commit 7304a795e5)
2020-05-04 19:01:14 +03:00
Piotr Sarna
bbe82236be clocks-impl: switch to thread-safe time conversion
std::gmtime() has a sad property of using a global static buffer
for returning its value. This is not thread-safe, so its usage
is replaced with gmtime_r, which can accept a local buffer.
While no regressions where observed in this particular area of code,
a similar bug caused failures in alternator, so it's better to simply
replace all std::gmtime calls with their thread-safe counterpart.

Message-Id: <39e91c74de95f8313e6bb0b12114bf12c0e79519.1588589151.git.sarna@scylladb.com>
(cherry picked from commit 05ec95134a)
2020-05-04 17:14:28 +03:00
Piotr Sarna
abd73cab78 alternator: fix signature timestamps
Generating timestamps for auth signatures used a non-thread-safe
::gmtime function instead of thread-safe ::gmtime_r.

Tests: unit(dev)
Fixes #6345

(cherry picked from commit fb7fa7f442)
2020-05-04 17:05:39 +03:00
Nadav Har'El
8fd7cf5cd1 alternator test: drastically reduce time to boot Scylla
The alternator test, test/alternator/run, runs Scylla and runs the
various tests against it. Before this patch, just booting Scylla took
about 26 seconds (for a dev build, on my laptop). This patch reduces
this delay to less than one second!

It turns out that almost the entire delay was artificial, two periods
of 12 seconds "waiting for the gossip to settle", which are completely
unnecessary in the one-node cluster used in the Alternator test.
So a simple "--skip-wait-for-gossip-to-settle 0" parameter eliminates
these long delays completely.

Amusingly, the Scylla boot is now so fast, that I had to change a "sleep 2"
in the test script to "sleep 1", because 2 seconds is now much more than
it takes to boot Scylla :-)

Fixes #6310.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200428145035.22894-1-nyh@scylladb.com>
(cherry picked from commit ff5615d59d)
2020-05-04 16:10:27 +03:00
Alejo Sanchez
dd88b2dd18 utils: error injection allocate string for remote invoke
Allocate string before sending to other shards.

Reported by Pavel Solodovnikov.

Refs #3295 (closed)

Tests: unit ({dev})

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20200328204454.1326514-2-alejo.sanchez@scylladb.com>
(cherry picked from commit e5a2ba32b9)

Ref #6342.
2020-05-03 19:33:34 +03:00
Hagit Segev
eee4c00e29 release: prepare for 4.0.rc3 2020-05-01 00:46:40 +03:00
Avi Kivity
85071ceeb1 Merge 'Fix hang in multishard_writer' from Asias
"
This series fix hang in multishard_writer when error happens. It contains
- multishard_writer: Abort the queue attached to consumers when producer fails
- repair: Fix hang when the writer is dead

Fixes #6241
Refs: #6248
"

* asias-stream_fix_multishard_writer_hang:
  repair: Fix hang when the writer is dead
  mutation_writer_test: Add test_multishard_writer_producer_aborts
  multishard_writer: Abort the queue attached to consumers when producer fails

(cherry picked from commit 8925e00e96)
2020-04-30 19:32:12 +03:00
Asias He
4cf201fc24 config: Do not enable repair based node operations by default
Give it some more time to mature. Use the old stream plan based node
operations by default.

Fixes: #6305
Backports: 4.0
(cherry picked from commit b8ac10c451)
2020-04-30 17:57:55 +03:00
Raphael S. Carvalho
c6ad5cf556 api/service: fix segfault when taking a snapshot without keyspace specified
If no keyspace is specified when taking snapshot, there will be a segfault
because keynames is unconditionally dereferenced. Let's return an error
because a keyspace must be specified when column families are specified.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200427195634.99940-1-raphaelsc@scylladb.com>
(cherry picked from commit 02e046608f)

Fixes #6336.
2020-04-30 12:49:13 +03:00
Piotr Sarna
51e3e6c655 Update seastar submodule
* seastar 251bc8f2...8bc24f48 (1):
  > http: make headers case-insensitive

Fixes #6319
2020-04-30 08:18:01 +02:00
Nadav Har'El
8ac6579b30 test.py: run Alternator test with the correct Scylla binary
The Alternator test's run script, test/alternator/run, runs Scylla.
By default, it chooses the last built Scylla executable build/*/scylla.

However, test.py has a "mode" option, that should be able to choose which
build mode to run. Before this patch, this mode option wasn't honored by
the Alternator test, so a "test.py alternator/run" would run the same
Scylla binary (the one last built) three times, instead of running each
of the three build modes.

We fix this in this patch: test.py now passes the "SCYLLA" environment
variable to the test/alternator/run script, indicating the location of the
Scylla binary with the appropriate build mode. The script already supported
this environment variable to override its default choice of Scylla binary.

In test.py, we add to the run_test() function an optional "env" parameter
which can be used to pass additional environment variables to the test.

Fixes #6286

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200427131958.28248-1-nyh@scylladb.com>
(cherry picked from commit 858a12755b)
2020-04-28 16:19:07 +03:00
Piotr Sarna
3744e66244 alternator: fix integer overflow warning in token generation
When generating tokens for parallel scan, debug mode undefined behavior
sanitizer complained that integer overflow sometimes happens when
multiplying two big values - delta and segment number.
In order to mitigate this warning, the multiplication is now split
into two smaller ones, and the generated machine code remains
identical (verified on gcc and clang via compiler explorer).

Fixes #6280
Tests: unit(dev)

(cherry picked from commit e17c237feb)
2020-04-28 16:15:31 +03:00
Piotr Sarna
d3bf349484 alternator: allow parallel scan
Parallel scans can be performed by providing Segment and TotalSegments
attributes to Scan request, which can be used to split the work among
many workers.
This test makes the parallel scan test succeed, so the xfail is removed.

Fixes #5059

(cherry picked from commit dbb9574aa2)
2020-04-28 16:07:43 +03:00
Nadav Har'El
3e6a8ba5bd test/alternator: increase timeout on Scylla boot
The Alternator test boots Scylla to test against it. We set an arbitrary
timeout for this boot to succeed: 100 seconds. This 100 seconds is
significantly more than 25 seconds it takes on my laptop, and I though
we'll never reach it. But it turns out that in some setups - running the
very slow debug build on slow and overcommitted nodes - 100 seconds is
not enough.

So this patch doubles the timeout to 200 seconds.

Note that this "200 seconds" is just a timeout, and doesn't affect normal
runs: Both a successful boot and a failed boot are recognized as soon as
they happen, and we never unnecessarily wait the entire 200 seconds.

Fixes #6271.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200422193920.17079-1-nyh@scylladb.com>
(cherry picked from commit 92e36c5df5)
2020-04-28 16:04:12 +03:00
Nadav Har'El
5f1785b9cf alternator: use RF=3 even if some nodes are temporarily down
Alternator is supposed to use RF=3 for new tables. Only when the cluster is
smaller than 3 nodes do we use RF=1 (and warn about it) - this is useful for
testing.

However, our implementation incorrectly tested the number of *live* nodes in
the cluster instead of the total number of nodes. As a result, if a 3-node
cluster had one node down, and a new table was created, it was created with
RF=1, and immediately could not be written because when RF=1, any node down
means part of the data is unavailable.

This patch fixes this: The total number of nodes in the cluster - not the
number of live nodes - is consulted. The three-node-cluster-with-a-dead-node
setup above creates the table with RF=3, and it can be written because two
living nodes out of three are enough when RF=3 and we do quorum writes and
reads.

We have a dtest to reproduce this bug (and its fix), and it's also easy to
reproduce manually by starting a 3-node cluster, killing one of the nodes,
and then running "pytests". Before this patch, the tests can create tables
but then fail to write to them. After this patch, the test succeed on the
same cluster with the dead node.

Fixes #6267

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200422182035.15106-2-nyh@scylladb.com>
(cherry picked from commit 1f75efb556)
2020-04-28 15:52:06 +03:00
Nadav Har'El
e1fd6cf989 gossiper: add convenience function for getting number of nodes
The gossiper has a convenience functions get_up_endpoint_count() and
get_down_endpoint_count(), but strangely no function to get the total
number. Even though it's easy to calculate the total by summing up their
result it is inefficient and also incovenient because of of these
functions returns a future.

So let's add another function, get_all_endpoint_count(), to get the
total number of nodes. We will use this function in the next patch.

Signed-off-by: Nadav Har'El <n...@scylladb.com>
Message-Id: <20200422182035.15106-1-nyh@scylladb.com>
(cherry picked from commit 08c39bde1a)
2020-04-28 15:51:37 +03:00
Piotr Sarna
b7328ff1e4 alternator: implement ScanIndexForward
The ScanIndexForward parameter is now fully implemented
and can accept ScanIndexForward=false in order to query
the partitions in reverse clustering order.
Note that reading partition slices in reverse order is less
efficient than forward scans and may put a strain on memory
usage, especially for large partitions, since the whole partition
is currently fetched in order to be reversed.

Fixes #5153

(cherry picked from commit 09e4f3b917)
2020-04-28 15:30:01 +03:00
Avi Kivity
602ed43ac7 Update seastar submodule
* seastar 76260705ef...251bc8f25d (1):
  > http server: fix "Date" header format

Fixes #6253.
2020-04-26 19:30:08 +03:00
Tomasz Grabiec
c42c91c5bb Merge "Drop only learnt value on PRUNE" from Gleb
It is unsafe to remove entire row, so only drop learn value from
system.paxos table.

Fixes: #6154
(cherry picked from commit e648e314e5)
2020-04-21 18:30:12 +03:00
Avi Kivity
cf017b320a test: alternator: configure scylla for test environment in terms of cpu and disk
Currently, the alternator tests configure scylla to use all the
logical cores in the host system, but only 1GB of RAM. This can lead
to a small amount of memory per core.

It also uses the default disk configuration, which is safe, but can be
very slow on mechanical or non-enterprise disks.

Change to use a fixed --smp 2 configuration, and add --overprovisioned
for maximum flexibility (no spinning). Use --unsafe-bypass-fsync
for faster performance on non-enterprise or mechanical disks, assuming
that the test data is not important.

Fixes #6251.
Message-Id: <20200420154112.123386-1-avi@scylladb.com>

(cherry picked from commit 2482e53de9)
2020-04-21 18:25:28 +03:00
Hagit Segev
89e79023ae release: prepare for 4.0.rc2 2020-04-21 16:26:09 +03:00
Nadav Har'El
bc67da1a21 alternator-test: comment out an error-path test that doesn't work on newer boto3
Unfortunately, the boto3 library doen't allow us to check some of the
input error cases because it unnecessarily tests its input instead of
just passing it to Alternator and allowing Alternator to report the error.
In this patch we comment out a test case which used to work fine - i.e.,
the error was reported by Alternator - until recent changes to boto3
made it catch the problem without passing it to Alternator :-(

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200330190521.19526-2-nyh@scylladb.com>
(cherry picked from commit fe6cecb26d)
2020-04-21 07:19:54 +02:00
Botond Dénes
0c7643f1fe schema: schema(): use std::stable_sort() to sort key columns
When multiple key columns (clustering or partition) are passed to
the schema constructor, all having the same column id, the expectation
is that these columns will retain the order in which they were passed to
`schema_builder::with_column()`. Currently however this is not
guaranteed as the schema constructor sort key columns by column id with
`std::sort()`, which doesn't guarantee that equally comparing elements
retain their order. This can be an issue for indexes, the schemas of
which are built independently on each node. If there is any room for
variance between for the key column order, this can result in different
nodes having incompatible schemas for the same index.
The fix is to use `std::stable_sort()` which guarantees that the order
of equally comparing elements won't change.

This is a suspected cause of #5856, although we don't have hard proof.

Fixes: #5856
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
[avi: upgraded "Refs" to "Fixes", since we saw that std::sort() becomes
      unstable at 17 elements, and the failing schema had a
      clustering key with 23 elements]
Message-Id: <20200417121848.1456817-1-bdenes@scylladb.com>
(cherry picked from commit a4aa753f0f)
2020-04-19 18:18:45 +03:00
Rafael Ávila de Espíndola
c563234f40 dht: Use get_random_number<uint64_t> instead of int64_t in token::get_random_token
I bisect the opposite change in
9c202b52da as the cause of issue 6193. I
don't know why. Maybe get_random_number<signed_type> is buggy?

In any case, reverting to uint64_t solves the issue.

Fixes #6193

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200418001611.440733-1-espindola@scylladb.com>
(cherry picked from commit f3fd466156)
2020-04-19 16:20:40 +03:00
Nadav Har'El
77b7a48a02 alternator: remove mentions of experimental status of LWT
Since commit 9948f548a5, the LWT no longer
requires an "experimental" flag, so Alternator documents and scripts
which referred to the need for enabling experimental LWT, are fixed here
to no longer do that.

Fixes #6118.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200405143237.12693-1-nyh@scylladb.com>
(cherry picked from commit d9d50362af)
2020-04-19 15:10:32 +03:00
Piotr Sarna
b2b1bfb159 alternator: fix failure on incorrect table name with no indexes
If a table name is not found, it may still exist as a local index,
but the check tried to fetch a local index name regardless if it was
present in the request, which was a nullptr dereference bug.

Fixes #6161
Tests: alternator-test(local, remote)
Message-Id: <428c21e94f6c9e450b1766943677613bd46cbc68.1586347130.git.sarna@scylladb.com>

(cherry picked from commit 123edfc10c)
2020-04-19 15:07:25 +03:00
Nadav Har'El
d72cbe37aa docs/alternator/alternator.md: fix typos
Fix a couple of typos in the Alternator documentation.
Fixes scylladb/scylla-doc-issues#280
Fixes scylladb/scylla-doc-issues#281

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200419091900.23030-1-nyh@scylladb.com>
(cherry picked from commit 7e7c688946)
2020-04-19 15:03:22 +03:00
Nadav Har'El
9f7b560771 docs, alternator: alternator.md cleanup
Clean up the alternator.md document, by:

* Updating out-of-date information that outstayed its welcome.
* When Scylla does have a feature but it's just not supported via the
  DynamoDB API (e.g., CDC and on-demand backups) mention that.
* Remove mention of Alternator being experimental and users should not
  store important data on it :-)
* Miscellaneous cleanups.

Fixes #6179.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200412094641.27186-1-nyh@scylladb.com>
(cherry picked from commit 606ae0744c)
2020-04-19 15:00:53 +03:00
Nadav Har'El
06af9c028c alternator-test: make Alternator tests runnable from test.py
To make the tests in alternator-test runnable by test.py, we need to
move the directory alternator-test/ to test/alternator, because test.py
only looks for tests in subdirectories of test/. Then, we need to create
a test/alternator/suite.yaml saying that this test directory is of type
"Run", i.e., it has a single run script "run" which runs all its tests.

The "run" script had to be slightly modified to be aware of its new
location relative to the source directory.

To run the Alternator tests from test.py, do:

	./test.py --mode dev alternator

Note that in this version, the "--mode" has no effect - test/alternator/run
always runs the latest compiled Scylla, regardless of the chosen mode.

The Alternator tests can still be run manually and individually against
a running Scylla or DynamoDB as before - just go to the test/alternator
directory (instead of alternator-test previously) and run "pytest" with
the desired parameters.

Fixes #6046

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 4e2bf28b84)
2020-04-19 11:19:15 +03:00
Nadav Har'El
c74ab3ae80 test.py: add xunit XML output file for "Run" tests
Assumes that "Run" tests can take the --junit-xml=<path> option, and
pass it to ask the test to generate an XML summary of the run to a file
like testlog/dev/xml/run.1.xunit.xml.

This option is honored by the Alternator tests.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 0cccb5a630)
2020-04-19 11:19:06 +03:00
Nadav Har'El
32cd3a070a test.py: add new test type "Run"
This patch adds a new test type, "Run". A test subdirectory of type "Run"
has a script called "run" which is expected to run all the tests in that
directory.

This will be used, in the next patch, by the Alternator functional tests.
These tests indeed have a "run" script, which runs Scylla and then runs
*all* of Alternator's tests, finishing fairly quickly (in less than a
minute). All of that will become one test.py test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 0ae3136900)
2020-04-19 11:18:01 +03:00
Nadav Har'El
bb1554f09e test.py: flag for aborting tests with SIGTERM, not SIGKILL
Today, if test.py is interrupted with SIGINT or SIGTERM, the ongoing test
is killed with SIGKILL. Some types of tests - such as Alternator's test -
may depend on being killed politely (e.g., with SIGTERM) to clean up
files.

We cannot yet change the signal to SIGTERM for all tests, because Seastar
tests often don't deal well with signals, but we can at least add a flag
that certain test types - that know they can be killed gently - will use.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 36e44972f1)
2020-04-19 11:17:51 +03:00
Nadav Har'El
2037d7550e alternator-test: change "run" script to pick random IP address
Before this patch, the Alternator tests "run" script ran Scylla on a fixed
listening address, 127.0.0.1. There is a problem that there might be other
concurrent runs of Scylla using the same IP address - e.g., CCM (used by
dtest) uses exactly this IP address for its first node.

Luckily, Linux's loopback device actually allows us to pick any of over
a million addresses in 127.0.0.0/8 to listen on - we don't need to use
127.0.0.1 specifically. So the code in this patch picks an address in
127.1.*.*, so it cannot collide with CCM (which uses 127.0.0.* for up to
255 nodes). Moreover, the last two bytes of the listen address are picked
based on the process ID of the run script; This allows multiple copies
of this script to run concurrently - in case anybody wishes to do that.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 24fcc0c0ff)
2020-04-19 11:17:39 +03:00
Nadav Har'El
c320c3f6da install-dependencies.sh: add dependencies for Alternator tests
To run Alternator tests, only two additional dependencies need to be added to
install-dependencies.sh: pytest, and python3-boto3. We also need
python3-cassandra-driver, but this dependency is already listed.

This patch only updates the dependencies for Fedora, which is what we need
for dbuild and our Jenkins setups.

Tested by building a new dbuild docker image and verifying that the
Alternator tests pass.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
[avi: update toolchain image; note this upgrades gcc to 9.3.1]
Message-Id: <20200330181128.18582-1-nyh@scylladb.com>
(cherry picked from commit 8627ae42a6)
2020-04-19 11:17:07 +03:00
Nadav Har'El
0ed70944aa alternator-test: run: use the Python driver, not cqlsh
The "run" script for the Alternator tests needs to set a system table for
authentication credentials, so we can test this feature.
So far we did this with cqlsh, but cqlsh isn't always installed on build
machines. But install-dependencies.sh already installs the Cassandra driver
for Python, so it makes more sense to use that, so this patch switches to
use it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200331131522.28056-1-nyh@scylladb.com>
(cherry picked from commit 55f02c00f2)
2020-04-19 11:16:54 +03:00
Nadav Har'El
89f860d409 alternator-test: add "--url" option to choose Alternator's URL
The "--aws" and "--local" test options chooses between two useful default
URLs - Amazon's, or http://localhost:8000 for a local installation.
However, sometimes one wants to run Scylla on a different IP address or
port, so in this patch we add a "--url" option to choose a specific URL to
connect to. For example, "--url http://127.1.2.3:1234".

We will later use this option in the alternator-test/run script, to pick
a random IP address on which to run Scylla, and then run the test against
this address.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 1aec4baa51)
2020-04-19 11:13:13 +03:00
Piotr Sarna
0819d221f4 test: add cases for empty paging state for index queries
In order to check regressions related to #6136 and similar issues,
test cases for handling paging state with empty partition/clustering
key pair are added.

(cherry picked from commit 88913e9d44)
2020-04-19 10:35:26 +03:00
Piotr Sarna
53f47d4e67 cql3: fix generating base keys from empty index paging state
An empty partition/clustering key pair is a valid state of the
query paging state. Unfortunately, recent attempts at debugging
a flaky test resulted in introducing an assertion which breaks
when trying to generate a key from such a pair.
In order to keep the assertion (since it still makes sense in its
scope), but at the same time translate empty keys properly,
empty keys are now explicitly processed at the beginning of the
function.
This behaviour was 100% reproducible in a secondary index dtest below.

Fixes #6134
Refs #5856
Tests: unit(dev),
       dtest(TestSecondaryIndexes.test_truncate_base)

(cherry picked from commit 45751ee24f)
2020-04-19 10:35:09 +03:00
Kamil Braun
21ad12669a sstables: freeze types nested in collection types in legacy sstables
Some legacy `mc` SSTables (created in Scylla 3.0) may contain incorrect
serialization headers, which don't wrap frozen UDTs nested inside collections
with the FrozenType<...> tag. When reading such SSTable,
Scylla would detect a mismatch between the schema saved in schema
tables (which correctly wraps UDTs in the FrozenType<...> tag) and the schema
from the serialization header (which doesn't have these tags).

SSTables created in Scylla versions 3.1 and above, in particular in
Scylla versions that contain this commit, create correct serialization
headers (which wrap UDTs in the FrozenType<...> tag).

This commit does two things:
1. for all SSTables created after this commit, include a new feature
   flag, CorrectUDTsInCollections, presence of which implies that frozen
   UDTs inside collections have the FrozenType<...> tag.
2. when reading a Scylla SSTable without the feature flag, we assume that UDTs
   nested inside collections are always frozen, even if they don't have
   the tag. This assumption is safe to be made, because at the time of
   this commit, Scylla does not allow non-frozen (multi-cell) types inside
   collections or UDTs, and because of point 1 above.

There is one edge case not covered: if we don't know whether the SSTable
comes from Scylla or from C*. In that case we won't make the assumption
described in 2. Therefore, if we get a mismatch between schema and
serialization headers of a table which we couldn't confirm to come from
Scylla, we will still reject the table. If any user encounters such an
issue (unlikely), we will have to use another solution, e.g. using a
separate tool to rewrite the SSTable.

Fixes #6130.

(cherry picked from commit 3d811e2f95)
2020-04-17 09:11:53 +03:00
Kamil Braun
c812359383 sstables: move definition of column_translation::state::build to a .cc file
Ref #6130
2020-04-17 09:11:38 +03:00
Piotr Sarna
1bd79705fb alternator: use partition tombstone if there's no clustering key
As @tgrabiec helpfully pointed out, creating a row tombstone
for a table which does not have a clustering key in its schema
creates something that looks like an open-ended range tombstone.
That's problematic for KA/LA sstable formats, which are incapable
of writing such tombstones, so a workaround is provided
in order to allow using KA/LA in alternator.

Fixes #6035

(cherry picked from commit 0a2d7addc0)
2020-04-16 12:01:51 +03:00
Avi Kivity
7e2ef386cc Update seastar submodule
* seastar 92c488706...76260705e (1):
  > rpc: always shutdown socket when stopping a client

Fixes #6060.
2020-04-16 10:56:31 +03:00
Avi Kivity
51bad7e72c Point seastar submodule at scylla-seastar.git branch-4.0
This allows us to backport seastar patches to Scylla 4.0.
2020-04-16 10:10:40 +03:00
Asias He
0379d0c031 repair: Send reason for node operations
Since 956b092012 (Merge "Repair based node
operation" from Asias), repair is used by other node operations like
bootstrap, decommission and so on.

Send the reason for the repair, so that we can handle the materialized
view update correctly according to the reason of the operation. We want
to trigger the view update only if the repair is used by repair
operation. Otherwise, the view table will be handled twice, 1) when the
view table is synced using repair 2) when the base table is synced using
repair and view table update is triggered.

Fixes #5930
Fixes #5998

(cherry picked from commit 066934f7c4)
2020-04-16 10:06:17 +03:00
Gleb Natapov
a8ef820f27 lwt: fix cas_now_pruning counter
Due to c&p error cas_now_pruning counter is increased instead of
decreased after an operation completes. Fix it.

Fixes #6116

Message-Id: <20200401142859.GA16953@scylladb.com>
(cherry picked from commit 4d9d226596)
2020-04-06 13:06:11 +02:00
Yaron Kaikov
9908f009a4 release: prepare for 4.0.rc1 2020-04-06 10:22:45 +03:00
Pavel Emelyanov
48d8a075b4 main: Do not destroy token_metadata
The storage_proxy instances hold references to token_metadata ones and
leave unwaited futures continuing to its query_partition_key_range_concurrent
method.

The latter is called from do_query so it's not that easy to find
out who is leaking. Keep the tokens not freed for a while.

Fixes: #6093
Test: manual start-stop

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200402183538.9674-1-xemul@scylladb.com>
(cherry picked from commit 86296ba557)
2020-04-05 13:47:57 +03:00
Konstantin Osipov
e3ddd607bc lwt: remove Paxos from experimental list
Always enable lightweight transactions. Remove the check for the command
line switch from the feature service, assuming LWT is always enabled.

Remove the check for LWT from Alternator.

Note that in order for the cluster to work with LWT, all nodes need
to support it.

Rename LWT to UNUSED in db/config.hh, to keep accepting lwt keyword in
--experimental-features command line option, but do nothing with it.

Changes in v2:
* remove enable_lwt feature flag, it's always there

Closes #6102

test: unit (dev, debug)
Message-Id: <20200401071149.41921-1-kostja@scylladb.com>
(cherry picked from commit 9948f548a5)
2020-04-05 08:56:42 +03:00
Piotr Jastrzebski
511773d466 token: relax the condition of the sanity check
When we switched token representation to int64_t
we added some sanity checks that byte representation
is always 8 bytes long.

It turns out that for token_kind::before_all_keys and
token_kind::after_all_keys bytes can sometimes be empty
because for those tokens they are just ignored. The check
introduced with the change is too strict and sometimes
throws the exception for tokens before/after all keys
created with empty bytes.

This patch relaxes the condition of the check and always
uses 0 as value of _data for special before/after all keys
tokens.

Fixes #6131

Tests: unit(dev, sct)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit a15b32c9d9)
2020-04-04 20:19:10 +03:00
Gleb Natapov
121cd383fa lwt: remove entries from system.paxos table after successful learn stage
The learning stage of PAXOS protocol leaves behind an entry in
system.paxos table with the last learned value (which can be large). In
case not all participants learned it successfully next round on the same
key may complete the learning using this info. But if all nodes learned
the value the entry does not serve useful purpose any longer.

The patch adds another round, "prune", which is executed in background
(limited to 1000 simultaneous instances) and removes the entry in
case all nodes replied successfully to the "learn" round.  It uses the
ballot's timestamp to do the deletion, so not to interfere with the
next round. Since deletion happens very close to previous writes it will
likely happen in memtable and will never reach sstable, so that reduces
memtable flush and compaction overhead.

Fixes #5779

Message-Id: <20200330154853.GA31074@scylladb.com>
(cherry picked from commit 8a408ac5a8)
2020-04-02 15:36:52 +02:00
Gleb Natapov
90639f48e5 lwt: rename "in_progress_ballot" cell to "promise" in system.paxos table
The value that is stored in "in_progress_ballot" cell is the value of
promised ballot, so call the cell accordingly to avoid confusion
especially as we have a notion of "in progress" proposal in the code
which is not the same as in_progress_ballot here.

We can still do it without care about backwards compatibility since LWT
is still marked as experimental.

Fixes #6087.

Message-Id: <20200326095758.GA10219@scylladb.com>
(cherry picked from commit b3db6f5b04)
2020-04-02 15:36:49 +02:00
Calle Wilund
8d029a04aa db::commitlog: Don't write trailing zero block unless needed
Fixes #5899

When terminating (closing) a segment, we write a trailing block
of zero so reader can have an empty region after last used chunk
as end marker. This is due to using recycled, pre-allocated
segments with potentially non-zero data extending over the point
where we are ending the segment (i.e. we are not fully filling
the segment due to a huge mutation or similar).

However, if we reach end of segment writing the final block
(typically many small mutations), the file will end naturally
after the data written, and any trailing zero block would in fact
just extend the file further. While this will only happen once per
segment recycled (independent on how many times it is recycled),
it is still both slightly breaking the disk usage contract and
also potentially causing some disk stalls due to metadata changes
(though of course very infrequent).

We should only write trailing zero if we are below the max_size
file size when terminating

Adds a small size check to commitlog test to verify size bounds.
(Which breaks without the patch)

v2:
- Fix test to take into account that files might be deleted
  behind our backs.
v3:
- Fix test better, by doing verification _before_ segments are
  queued for delete.

Message-Id: <20200226121601.15347-2-calle@scylladb.com>
Message-Id: <20200324100235.23982-1-calle@scylladb.com>
(cherry picked from commit 9fee712d62)
2020-03-31 14:22:20 +03:00
Asias He
67995db899 gossip: Add an option to force gossip generation
Consider 3 nodes in the cluster, n1, n2, n3 with gossip generation
number g1, g2, g3.

n1, n2, n3 running scylla version with commit
0a52ecb6df (gossip: Fix max generation
drift measure)

One year later, user wants the upgrade n1,n2,n3 to a new version

when n3 does a rolling restart with a new version, n3 will use a
generation number g3'. Because g3' - g2 > MAX_GENERATION_DIFFERENCE and
g3' - g1 > MAX_GENERATION_DIFFERENCE, so g1 and g2 will reject n3's
gossip update and mark g3 as down.

Such unnecessary marking of node down can cause availability issues.
For example:

DC1: n1, n2
DC2: n3, n4

When n3 and n4 restart, n1 and n2 will mark n3 and n4 as down, which
causes the whole DC2 to be unavailable.

To fix, we can start the node with a gossip generation within
MAX_GENERATION_DIFFERENCE difference for the new node.

Once all the nodes run the version with commit
0a52ecb6df, the option is no logger
needed.

Fixes #5164

(cherry picked from commit 743b529c2b)
2020-03-30 12:36:20 +02:00
Yaron Kaikov
282cd0df7c dist/docker: Update SCYLLA_REPO_URL and VERSION defaults
Update the SCYLLA_REPO_URL and VERSION defaults to point to the latest
unstable 4.0 version. This will be used if someone runs "docker build"
locally. For the releases, the release pipelines will pass the stable
version repository URL and a specific release version.
2020-03-26 09:54:44 +02:00
Nadav Har'El
ce58994d30 sstable: default to LA format instead of KA format
Over the years, Scylla updated the sstable format from the KA format to
the LA format, and most recently to the MC format. On a mixed cluster -
as occurs during a rolling upgrade - we want all the nodes, even new ones,
to write sstables in the format preferred by the old version. The thinking
is that if the upgrade fails, and we want to downgrade all nodes back to
the older version, we don't want to lose data because we already have
too-new sstables.

So the current code starts by selecting the oldest format we ever had - KA,
and only switching this choice to LA and MC after we verify that all the
nodes in the cluster support these newer formats.

But before an agreement is reached on the new format, sstables may already
be created in the antique KA format. This is usually harmless - we can
read this format just fine. However, the KA format has a problem that it is
unable to represent table names or keyspaces with the "-" character in them,
because this character is used to separate the keyspace and table names in
the file name. For CQL, a "-" is not allowed anyway in keyspace or table
names; But for Alternator, this character is allowed - and if a KA table
happens to be created by accident (before the LA or MC formats are chosen),
it cannot be read again during boot, and Scylla cannot reboot.

The solution that this patch takes is to change Scylla's default sstable
format to LA (and, as before, if the entire cluster agrees, the newer MC
format will be used). From now on, new KA tables will never be written.
But we still fully support *reading* the KA format - this is important in
case some very old sstables never underwent compaction.

The old code had, confusingly, two places where the default KA format
was chosen. This patch fixes is so the new default (LA) is specified in
only one place.

Fixes #6071.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200324232607.4215-2-nyh@scylladb.com>
(cherry picked from commit 91aba40114)
2020-03-25 13:27:51 +01:00
Yaron Kaikov
78f5afec30 release: prepare for 4.0.rc0 2020-03-24 23:33:23 +02:00
Nadav Har'El
f1aaa91e21 merge: add metrics
Merged pull request https://github.com/scylladb/scylla/pull/6030 from
Piotr Dulikowski:

Adds CDC-related metrics.

Following counters are added, both for total and failed operations:

    Total number of CDC operations that did/did not perform splitting,
    Total number of CDC operations that touched a particular mutation part.
    Total number of preimage selects.

Fixes #6002.
Tests: unit(dev, debug)

* 'cdc-metrics' of github.com:piodul/scylla:
  storage_proxy: track CDC operations in LWT flow
  storage_proxy: track CDC operations in logged batches
  storage_proxy: track CDC operations in standard flow
  storage_proxy: add cdc tracker hooks to write response handlers
  storage_proxy: move "else if" remainder into "else" block
  cdc: create an operation_result_tracker object
  cdc: add an object for tracking progress of cdc mutations
  cdc: count touched mutation parts in transformer::transform
  cdc: track preimage selects in metrics
  cdc: register metric counters
  cdc: fix non-atomic updates in splitting
2020-03-23 21:55:58 +02:00
Botond Dénes
ec36c7cb2f test: random_schema: remove redundant gc grace period from tombstone expiry
Compaction automatically adds gc grace period to expiry times already,
no need to add it when creating the tombstones. Remove the redundant
additions form the code. The direct impact is really minor as this is
only used in tests, but it might confuse readers who are looking at how
tombstones are created across the codebase.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200323120948.92104-1-bdenes@scylladb.com>
2020-03-23 15:12:25 +02:00
Piotr Dulikowski
736c1c6056 storage_proxy: track CDC operations in LWT flow
Register cdc operation result tracker during LWT flow.
2020-03-23 14:05:25 +01:00
Piotr Dulikowski
f7fd6f4607 storage_proxy: track CDC operations in logged batches
Register cdc operation result tracker in logged batch flow.
2020-03-23 14:05:25 +01:00
Piotr Dulikowski
ef1c62aa04 storage_proxy: track CDC operations in standard flow
Register cdc operation result tracker for write response handlers
coming from the usual write requests.
2020-03-23 14:05:25 +01:00
Piotr Dulikowski
cccc33f0fd storage_proxy: add cdc tracker hooks to write response handlers
Adds a field to abstract_write_response_handler that points to the cdc
operation result tracker, and a function for registering the tracker in
the handlers that currently write to a CDC log table.
2020-03-23 14:05:25 +01:00
Piotr Dulikowski
dc05d30fd3 storage_proxy: move "else if" remainder into "else" block
In the following commit, more code will be added to the newly created
"else" block.
2020-03-23 14:05:25 +01:00
Piotr Dulikowski
5a5cc57878 cdc: create an operation_result_tracker object
An `operation_result_tracker` object is now returned as a second return
value from the `augment_mutation_call` function.
2020-03-23 14:05:25 +01:00
Piotr Dulikowski
1b92cbeabe cdc: add an object for tracking progress of cdc mutations
CDC metrics, apart from tracking "total" metrics for all performed CDC
operations, also track metrics for "failed" operations. Because the
result of the CDC operation depends on whether all CDC mutations were
written successfully by storage_proxy, checking for failure and
incrementing appropriate counters is deferred after all write response
handlers finish.

The `cdc::operation_result_tracker` object was created for that purpose.
It contains all the details needed to accurately update the metrics
based on what actually happened in the `augment_mutation_call` function,
and holds a flag which tells if any of write response handlers failed.
This object is supposed to be referenced by write response handlers for
CDC mutations created after the same `augment_mutation_call`. After all
write response handlers are destroyed, the destructor of
`operation_result_tracker` will update appropriate metrics.

Actual creating and attaching this object to write response handlers
will be done in subsequent commits.
2020-03-23 14:05:25 +01:00
Piotr Dulikowski
98e5fdc7ac cdc: count touched mutation parts in transformer::transform
Modifies the transformer::transform so that it also returns a set of
flags indicating what parts of the mutation (e.g. rows, tombstones,
collections, etc.) were processed during transforming.
2020-03-23 14:05:25 +01:00
Piotr Dulikowski
53570d8657 cdc: track preimage selects in metrics
This commit causes preimage select counter to be increased after
performing this operation.
2020-03-23 14:05:25 +01:00
Piotr Dulikowski
e7062de02b cdc: register metric counters
This patch defines a CDC metrics object and registers all of its
counters.

storage_proxy is chosen as the owner of the metrics object. Because in
subsequent commits it will become possible for CDC metrics to be updated
after a write operation ends, and because the cdc_service has shorter
lifetime than storage_proxy, we could risk a use-after-free if we placed
this object inside cdc_service.
2020-03-23 14:05:25 +01:00
Piotr Dulikowski
338e473946 cdc: fix non-atomic updates in splitting
This patch fixes a bug in mutation splitting logic of CDC. In the part
that handles updates of non-atomic clustering columns, the column
definition was fetched from a static column of the same id instead of
the actual definition of the clustering column. It could cause the value
to be written to a wrong column.

Tests: unit(dev)
2020-03-23 13:47:23 +01:00
Ivan Prisyazhnyy
5ec7e77b2e api: /column_family/major_compaction/{keyspace:table} implementation
This implements support for triggering major compations through the REST
API. Please note that "split_output" is not supported and Glauber Costa
confirmed this this is fine:

  "We don't support splits, nor do I think we should."

Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2020-03-23 13:48:29 +02:00
Avi Kivity
0d885dbb00 Merge "Make all headers standalone" from Botond
"
Make sure all headers compile on their own, without requiring any
additional includes externally.

Even though this requirement is not documented in our coding guides it
is still quasi enforced and we semi-regularly get and merge patches
adding missing includes to headers.

This patch-set fixes all headers and adds a `{mode}-headers` target that
can be used to verify each header. This target should be built by
promotion to ensure no new non-conforming code sneaks in.
Individual headers can be verified using the
`build/dev/path/to/header.hh.o` target, that is generated for every
header.

The majority of the headers was just missing `seastarx.hh`. I think we
should just include this via a compiler flag to remove the noise from
our code (in a followup).
"

* 'compiling-headers/v2' of https://github.com/denesb/scylla:
  configure.py: add {mode}-headers phony target
  treewide: add missing headers and/or forward declarations
  test/boost/sstable_test.hh: move generic stuff to test/lib/sstable_utils.hh
  sstables: size_tiered_backlog_tracker: move methods out-of-line
  sstables: date_tiered_compaction_strategy.hh: move methods out-of-line
2020-03-23 13:09:09 +02:00
Avi Kivity
c6a441f9c2 Update seastar submodule
* seastar 3c498abcab...92c488706c (14):
  > dpdk: restore including reactor.hh
  > tests: distributed_test: add missing #include <mutex>
  > reactor: un-static-ify make_pollfn()
  > merge: Reduce inclusions of reactor.hh
A few #includes added to compensate for this
  > sharded: delete move constructor
  > future: Avoid a move constructor call
  > future: Erase types a bit more in then_wrapped
  > memory: Drop a never nullopt optional
  > semaphore: specify get_units and with_semaphore as noexcept
  > spinlock.hh: Add include for <cassert> header
  > dpdk: Avoid a variable sized array
  > future: Add an explicit promise member to continuation
  > net: remove smart pointer wrappers around pollable_fd
  > Merge "cleanup reactor file functions" from Benny
2020-03-23 11:59:30 +02:00
Piotr Dulikowski
a693e6ff6c cdc: fix non-atomic updates in splitting
This patch fixes a bug in mutation splitting logic of CDC. In the part
that handles updates of non-atomic clustering columns, the schema for
serializing that column was looked up incorrectly in the table schema -
instead of a `regular_column`, a `static_column` was looked up.

Due to how the `column_at` function works, a correct schema was always
returned if the table had no static columns. Therefore, in order for
this bug to manifest, a table with a static column and a regular column
with non-atomic collection was needed.
2020-03-23 10:20:24 +01:00
Piotr Sarna
602a771105 Merge 'utils: error injector API' from Alejo
Closes #3295

The error_injection class allows injecting custom handlers into normal control
flow at the pre-determined injection points.

This is especially useful in various testing scenarios:
 * Throwing an exception at some rare and extreme corner-cases
 * Injecting a delay to test for timeouts to be handled correctly
 * More advanced uses with custom lambda as an injection handler

Injection points are defined by `inject` calls.

Enabling and disabling injections are done by the corresponding
`enable` and `disable` calls.

REST frontend APIs is provided for convenience.

Branch URL:  https://github.com/alecco/scylla/tree/as_error_injection

Tests: unit {{dev}}, unit {{debug}}

* 'as_error_injection' of github.com:alecco/scylla:
  api: add error injection to REST API
  utils: add error injection
2020-03-23 08:39:22 +01:00
Botond Dénes
5174acb359 configure.py: add {mode}-headers phony target 2020-03-23 09:29:45 +02:00
Botond Dénes
e0284bb9ee treewide: add missing headers and/or forward declarations 2020-03-23 09:29:45 +02:00
Botond Dénes
575466b2cf test/boost/sstable_test.hh: move generic stuff to test/lib/sstable_utils.hh
sstable_test.hh started as collection of utilities shared between the
various `_sstable_test.cc` files. Predictably other tests started using
it as well, among them some that are non boost unit tests. This poses a
problem as if we add the missing boost/test/unit_test.hpp include to
sstable_test.hh these tests will suddenly have missing symbols from
boost::test. To avoid linking boost::test into all these users, extract
utilities more widely used into sstable_utils.hh
2020-03-23 09:29:45 +02:00
Botond Dénes
84329a16ee sstables: size_tiered_backlog_tracker: move methods out-of-line 2020-03-23 09:29:45 +02:00
Botond Dénes
d58ec632e3 sstables: date_tiered_compaction_strategy.hh: move methods out-of-line 2020-03-23 09:26:19 +02:00
Glauber Costa
dd65f7dcbb tests: move token_generation_for_shard to common code
We now have a utils file for SSTables. This is potentially useful for
other tests.

As a matter of fact, this function is repeated right now for the
resharding test. And to add insult to injury, the version in the
resharding test has the parameters shard and number of tokens flipped,
which although extremely confusing is the predictable outcome of
such repetition

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-03-22 19:00:26 +02:00
Asias He
be1a196988 repair: Handle keyspace with zero table
The following error was seen in
materialized_views_test.py:TestMaterializedViews.decommission_node_during_mv_insert_4_nodes_test

INFO [shard 0] repair - repair id 3 to sync data for
keyspace=ks, status=started repair/repair.cc:662:36: runtime error: member call
on null pointer of type 'const struct schema'
Aborting on shard 0.

The problem is in the test a keyspace was created without creating any
table. Since db19a76b1f(selective_token_range_sharder: stop calling
global_partitioner()), in get_partitioner_for_tables, we access nullptr
when no table is present.

	schema_ptr last_s;
	for (auto t: tables) {
	    // set last_s
	}
	last_s->get_partitione()

To fix:

1) Skip the repair in sync_data_using_repair if there is no table in the keyspace
2) Throw if no schema_ptr is found in get_partitioner_for_tables. Be defensive.

After:

INFO [shard 0] repair - decommission_with_repair: started with keyspace=ks, leaving_node=127.0.0.2, nr_ranges=744
INFO [shard 0] repair - repair id 3 to sync data for keyspace=ks, status=started
WARN [shard 0] repair - repair id 3 to sync data for keyspace=ks, no table in this keyspace
INFO [shard 0] repair - repair id 3 completed successfully
INFO [shard 0] repair - repair id 3 to sync data for keyspace=ks, status=succeeded

Tests: materialized_views_test.py:TestMaterializedViews.decommission_node_during_mv_insert_4_nodes_test
Fixes: #6022
2020-03-22 13:46:36 +02:00
Avi Kivity
d310e7c7ea Merge 'repair: Ignore keyspace that is removed in sync_data_using_repair' from Asias
repair: Ignore keyspace that is removed in sync_data_using_repair

When a keyspace is removed during node operations, we should not fail
the whole operation. Ignore the keyspace that is removed.

Fixes #5942

* asias-repair_fix_5942:
  repair: Stop the nodes that have run repair_row_level_start
  repair: Ignore keyspace that is removed in sync_data_using_repair
2020-03-22 13:19:51 +02:00
Takuya ASADA
005211bad6 redis: add lolwut command
Add lolwut command that shows redis version and ascii art.

see: https://redis.io/commands/lolwut
2020-03-22 13:16:20 +02:00
Takuya ASADA
2ab366e653 install.sh: create user/group correctly on redhat variants
Seems like adduser in redhat variants and deiban variants are incompatible,
and there is no addgroup in redhat variants.
Since adduser in install.sh is implemented on debian variants, does not work on redhat compatible.

To fix this we need to use 'useradd' / 'groupadd' instead.

Fixes #6018
2020-03-22 13:13:00 +02:00
Avi Kivity
7ed083a6a7 Merge "test.py: Allow to change the tests starting order" from Pavel E
"
In debug mode some tests take veeery looong time to finish,
those tests are better to be started first. This set adds
this by marking such long tests in suite.yaml files.

Tests: unit(dev)
"

* 'br-split-unit-tests-sorting-2' of https://github.com/xemul/scylla:
  test.py: Mark some tests as "run_first"
  test.py: Generate list with short names
  test.py: Rename "long" to "skip_in_debug_mode"
2020-03-21 19:53:23 +02:00
Rafael Ávila de Espíndola
482fbfcfdb build: Use more strict stack frame limits
A recent seastar update has resolved the worse offenders, so we can
lower the limit a bit to warn on the next set of functions.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200317183209.1664860-1-espindola@scylladb.com>
2020-03-21 19:51:57 +02:00
Rafael Ávila de Espíndola
01ac4aef3a everywhere: Use futurize_apply instead of futurize<void>::apply
No functionality change, just simpler.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200318234149.283090-1-espindola@scylladb.com>
2020-03-21 19:51:38 +02:00
Rafael Ávila de Espíndola
0d7281ca06 sstable: Move sstables_manager constructor out of line
There is no reason to have it in a header.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200320005225.178381-1-espindola@scylladb.com>
2020-03-21 19:47:29 +02:00
Piotr Dulikowski
6c5c745e25 cdc: add cdc log schema test 2020-03-21 07:33:35 +01:00
Piotr Dulikowski
3bfb044bf1 cdc: do not create cdc$deleted columns for pks and cks
Primary key and clustering key column should not have a corresponding
"cdc$deleted_<name>" column in cdc log table, because it does not make
sense to delete such a column from a row.

Fixes: #6049
Tests: unit(dev)
2020-03-21 07:33:23 +01:00
Pekka Enberg
6b2cd1bd7d Revert "db::commitlog: Don't write trailing zero block unless needed"
This reverts commit 0b34d88957. According
to Rafael Avila de Espindola:

"I have bisected the recent failures [in commitlog_test] on next to this
 patch."
2020-03-20 22:30:58 +02:00
Pekka Enberg
12b6092ac2 Revert "sstables: Fix incorrect calculation of Compaction Backlog"
This reverts commit 458ef4bb06. According
to Glauber Costa:

"It may give us the illusion that fixes something for a particular case
 but this fix is wrong.

 I am trying to help Raphael figure out why the backlog is wrong but
 this patch is not the answer."
2020-03-20 22:28:57 +02:00
Piotr Sarna
331ddf41e5 api: add error injection to REST API
Simple REST API for error injection is implemented.
The API allow the following operations:
 * injecting an error at given injection name
 * listing injections
 * disabling an injection
 * disabling all injections

 Currently the API enables/disables on all shards.

Closes #3295

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-03-20 20:49:03 +01:00
Pavel Solodovnikov
057adc8b4d utils: add error injection
Error injection class is implemented in order to allow injecting
various errors (exceptions, stalls, etc.) in code for testing
purposes.

Error injection is enabled via compile flag
 SCYLLA_ENABLE_ERROR_INJECTION

TODO: manage shard instances

Enable error injection in debug/dev/sanitize modes.

Unit tests for error injection class.

Closes #3295

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-03-20 19:37:48 +01:00
Rafael Ávila de Espíndola
9445608df6 gms: Add a default constructor to feature_config
Also move it out of line while at it.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200316180321.45914-1-espindola@scylladb.com>
2020-03-20 13:34:26 +01:00
Nadav Har'El
df8b3cd5dc alternator-test: a "run" script
Running the Alternator tests is easy after you manually run Scylla, but
sometimes it's convenient to have a script which just does everything
automatically: start Scylla in a temporary directory, set it up properly
for the tests (especially the authentication), run all the tests, and remove
the temporary directory. This is what this alternator-tests/run script does.

This script can be run by Jenkins, for example, to check all the Alternator
tests. The script assumes some things (including cqlsh, pytest and the boto3
library) are already installed, and that Scylla has been compiled - by
default it takes the latest built build/*/scylla, but this can be overridden
by a command like

    SCYLLA=build/release/scylla alternator-test/run

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200311091918.16170-1-nyh@scylladb.com>
2020-03-19 15:49:46 +01:00
Nadav Har'El
2deba4035a merge: Hook alternator to admission control
Merged patch series from Piotr Sarna:

This series hooks alernator to admission control, similarly to how
CQL server uses it. The estimated memory consumption is set to 2x
raw JSON request, since that seems to be the upper limit of
how much more memory rapidjson allocates during parsing.
Note, that since Seastar HTTP currently reads the whole contents
upfront, there's no easy way to apply admission control before reading
the request - that would involve some changes to our HTTP API.

Note 2: currently, admission control in CQL does not properly pass
memory consumption information for requests that are bounced
to another shard - that would require either transferring semaphore units
between shards or keeping a foreign pointer to the original units.
As a result, alternator also does not pass correct admission control
info between shards, and all places in code which do that are marked
with clear FIXMEs.

Fixes #5029

Piotr Sarna (5):
  storage_service: add memory limiter semaphore getter
  alternator: add service permit to callbacks
  alternator: add memory limiter to alternator server
  alternator: add addmission control stats entry
  alternator: hook admission control to alternator server

 alternator/executor.cc      | 113 ++++++++++++++++++++++--------------
 alternator/executor.hh      |  32 +++++-----
 alternator/rmw_operation.hh |   1 +
 alternator/server.cc        |  83 +++++++++++++++-----------
 alternator/server.hh        |   8 ++-
 alternator/stats.cc         |   2 +
 alternator/stats.hh         |   1 +
 main.cc                     |   3 +-
 service/storage_service.hh  |   4 ++
 9 files changed, 149 insertions(+), 98 deletions(-)
2020-03-19 15:51:17 +02:00
Nadav Har'El
7922b9eb8f materialized views: reduce recompilation when db/view/view.hh changes.
Before this patch, when db/view/view.hh was modified, 89 source files had to
be recompiled. After this patch, this number is down to 5.

Most of the irrelevant source files got view.hh by including database.hh,
which included view.hh just for the definition of statistics. So in this
patch we split the view statistics to a separate header file, view_stats.hh,
and database.hh only includes that. A few source files which included
only database.hh and also needed view.hh (for materialized-view related
functions) now need to include view.hh explicitly.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200319121031.540-1-nyh@scylladb.com>
2020-03-19 15:46:14 +02:00
Piotr Dulikowski
59727fb34b cdc: remove result_callback
The `result_callback` was a callback returned by `augment_mutation_call`
that was supposed to be used in the CDC postimage implementation.
Because CDC postimage was implemented without using this callback, and
currently a no-op function is always returned, this callback can safely
be removed.
2020-03-19 14:55:07 +02:00
Pavel Emelyanov
7af3bbd57b test.py: Mark some tests as "run_first"
Those tests take long time to finish, so it makes sense to start
them earlier than others.

The provided list of long tests consists of those running more
than 10 minutes in debug mode.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-03-19 12:52:18 +03:00
Rafael Ávila de Espíndola
e28b17de88 auth: Make create_metadata_table_if_missing noexcept
It returns a future, so converting an exception to an exceptional
future simplifies error handling in the caller.

Without this code like the one in
standard_role_manager::create_metadata_tables_if_missing has a
surprising behavior:

    return when_all_succeed(
            create_metadata_table_if_missing(...),
            create_metadata_table_if_missing(...));

Since it might not wait for both futures. We could use the lambda
version of when_all_succeed, but changing
create_metadata_table_if_missing seems a nice API improvement.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200317002051.117832-4-espindola@scylladb.com>
2020-03-19 10:22:50 +01:00
Piotr Sarna
0c11e07faf view,table: fix waiting for view updates during building
View updates sent as part of the view building process should never
be ignored, but fd49fd7 introduced a bug which may cause exactly that:
the updates are mistakenly sent to background, so the view builder
will not receive negative feedback if an update failed, which will
in turn not cause a retry. Consequently, view building may report
that it "finished" building a view, while some of the updates were
lost. A simple fix is to restore previous behaviour - all updates
triggered by view building are now waited for.

Fixes #6038
Tests: unit(dev),
dtest: interrupt_build_process_with_resharding_low_to_half_test
2020-03-19 10:50:54 +02:00
Pavel Emelyanov
59bc116695 test.py: Generate list with short names
The list will be sorted a bit differently, for this I will need
the shortname at once

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-03-19 11:46:02 +03:00
Pavel Emelyanov
30c540aae1 test.py: Rename "long" to "skip_in_debug_mode"
The "long" test will mean that it is to be started first, not
skipped, so rename "long" to avoid additional confusion

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-03-19 11:45:55 +03:00
Piotr Sarna
62c34a9085 cql: fix qualifying indexed columns for filtering
When qualifying columns to be fetched for filtering, we also check
if the target column is not used as an index - in which case there's
no need of fetching it. However, the check was incorrectly assuming
that any restriction is eligible for indexing, while it's currently
only true for EQ. The fix makes a more specific check and contains
many dynamic casts, but these will hopefully we gone once our
long planned "restrictions rewrite" is done.
This commit comes with a test.

Fixes #5708
Tests: unit(dev)
2020-03-19 10:34:16 +02:00
Tomasz Grabiec
5fe626a887 sstables: Release reserved space for sharding metadata
The intention of the code was to clear sharding metadata
chunked_vector so that it doesn't bloat memory.

The type of c is `chunked_vector*`. Assigning `{}`
clears the pointer while the intended behavior was to reset the
`chunked_vector` instance. The original instance is left unmodified
with all its reserved space.

Because of this, the previous fix had no effect because token ranges
are stored entirely inline and popping them doesn't realease memory.

Fixes #4951

Tests:
  - sstable_mutation_test (dev)
  - manual using scylla binary on customer data on top of 2019.1.5

Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <1584559892-27653-1-git-send-email-tgrabiec@scylladb.com>
2020-03-19 09:46:27 +02:00
Pekka Enberg
0d2b70798f reloc/build_reloc.sh: Remove unused functions
The is_redhat_variant() and is_debian_variant() funtions are not used so
let's remove them.

Message-Id: <20200317155740.12916-1-penberg@scylladb.com>
2020-03-19 08:39:57 +01:00
Rafael Ávila de Espíndola
7401a63e92 auth: Handle permission cache not being initialized
auth::service::start can fail before _permissions_cache is
initialized, so we should not assume that it is always set.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200317002051.117832-3-espindola@scylladb.com>
2020-03-18 20:21:24 +01:00
Rafael Ávila de Espíndola
3c2851aafc test: Make sure auth_service is always stopped
An exception thrown after the start of auth_service and before
init_server_without_the_messaging_service_part returns would cause the
sharded<auth_service> destructor to assert.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200317002051.117832-2-espindola@scylladb.com>
2020-03-18 20:17:55 +01:00
Botond Dénes
e6e894d871 scylla-gdb.py: introduce scylla small-objects
When investigating OOM related cores, a common thing to do is trying to
identify the objects in a particularly heavily populated size-class.
This command is meant to help with that, providing a way to list the
objects in any size-class, in a paginated way.

Traversing the objects of a pool is done through a
`small_object_iterator` object which is also exposed to python code, to
be used in custom scripts wanting to scan all objects belonging to a
pool.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200318085437.452906-1-bdenes@scylladb.com>
2020-03-18 13:33:59 +02:00
Raphael S. Carvalho
0df8faeaa2 sstables: make delete_atomically() work with empty set
If delete_atomically() was called with a empty set for any reason,
it will fail to work because it relies on any of the sstables in
the set for getting the sstable directory.
This will be needed, in the future, when using sstable replacement
function only with new sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200305144657.9440-1-raphaelsc@scylladb.com>
2020-03-18 13:29:42 +02:00
Pavel Emelyanov
da3bf20e71 main: Respect config start_native_transport option
There's such an option, and it's not taken into account
on scylla start. There's a symmetrical start_rpc one, which
is, so make both act similarly.

The default value for the option is true, so default set-ups
will not get broken.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200310140518.29410-1-xemul@scylladb.com>
2020-03-18 11:17:56 +02:00
Avi Kivity
164881696b Merge "scylla-gdb.py: scylla_memory: handle per-sg coordinator stats" from Botond
"
Since b783d40aa storage-proxy maintains separate coordinator stats per
scheduling group. This broke scylla_memory, which was still trying to
access the old global stats. This mini-series updates it to be able to
handle per-sg coordinator stats, while preserving backward compatibility
with older versions still using global stats.
"

* 'scylla-memory-per-sg-coordinator-stats/v1' of https://github.com/denesb/scylla:
  scylla-gdb.py: scylla_memory: update w.r.t. per-sg coordinator stats
  scylla-gdb.py: scylla_memory: move coordinator code to print_coordinator_stats()
2020-03-18 12:38:44 +02:00
Avi Kivity
c766f50491 Merge "Split some unit tests into smaller pieces" from Pavel E
"
The debug mode unit tests take ~half-an-hour to complete. Here's
the tests run-times top list

Test:					Time (seconds):
            ... steady tail goes here ...
test/boost/user_function_test		496
test/boost/row_cache_test		502
test/boost/view_schema_test		932
test/boost/cql_query_test		997
test/boost/mutation_reader_test		1048
test/boost/sstable_mutation_test	1417
test/boost/secondary_index_test		1468

Splitting the spike (top-5) is the primary goal. However, the
distribution of test-cases in 3 of those tests is also _very_
non-uniform, so just cutting it into equal parts doesn't work.
For example, the test_index_with_paging from the slowest one
takes ~14 minutes on its own and is the slowest test-case out
there.

So the set does this:

- moves the champion test_index_with_paging into separate file
- detaches the most heavy parts from sstable_mutation_test and
  mutation_reader_test into own tests too. The resulting split
  is still non-uniform, but it's 4 tests that run notably less
  than the 14 minutes record each
- splits the cql_query_test and view_schema_test into several
  parts in a wildcard manner to run out of the 14 min threshold
- moves some shared code into lib/

As the result, the debug mode test run takes 14.5 minutes =)
which is almost 2 times faster than it was. The dev mode run
time is not affected noticeably.

Test: well, unit(debug) and unit(dev)
"

* 'br-split-unit-tests-3-next' of https://github.com/xemul/scylla:
  test: Split view_schema_test
  test: Split cql_query_test
  test: Split mutation_reader_test
  test: Split sstable_mutation_test
  test: Split secondary_index test
2020-03-18 12:19:32 +02:00
Pavel Emelyanov
96e3d0fa36 mutation_partition: Debloat header form others
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200317191051.12623-1-xemul@scylladb.com>
2020-03-18 11:53:36 +02:00
Asias He
cdcedf5eb9 gossip: Make is_safe_for_bootstrap more strict
Consider

1. Start n1, n2 in the cluster
2. Stop n2 and delete all data for n2
3. Start n2 to replace itself with replace_address_first_boot: n2
4. Kill n2 before n2 finishes the replace operation
5. Remove replace_address_first_boot: n2 from scylla.yaml of n2
6. Delete all data for n2
7. Start n2

At step 7, n2 will be allowed to bootstrap as a new node, because the
application state of n2 in the cluster is HIBERNATE which is not
rejected in the check of is_safe_for_bootstrap. As a result, n2 will
replace n2 with a different tokens and a different host_id, as if the
old n2 node was removed from the cluster silently.

Fixes #5172
2020-03-17 17:37:16 +01:00
Tomasz Grabiec
488482c55a Merge "lwt: ensure unqualified SELECT works with SERIAL cl" from Kostja
Ensure unqualified SELECT throws an appropriate exception with
SERIAL consistency level.
Since such query touches multiple partitions, we don't support it
in SERIAL mode.

Branch URL:
https://github.com/kostja/scylla/tree/gh-6016-crash-lwt-select
2020-03-17 17:24:06 +01:00
Konstantin Osipov
4978bb513d test: add a test case for SERIAL read consistency
Pass custom query options to execute_prepared and
add a test case for custom SERIAL consistency.
2020-03-17 18:58:12 +03:00
Konstantin Osipov
f5180725df lwt: check SELECT restricts partition key before accessing it
Check that SELECT statement checks there is a partition key before
accessing it when determining the shard to execute the query on.

Essentially move the check for properly restricted partition key
from storage_proxy.cc to select_statement.cc, now that we access
it earlier in the call stack.

Keep the check in storage_proxy.cc since storage_proxy::query()
has other call sites (views), which today should never use
serial consistency for its queries, but this can change in the future.

Please note that Cassandra only partially enforce SERIAL consistency
and can silently downgrade SERIAL consistency to the default
non-serial one when doing unbounded SELECTS (
https://issues.apache.org/jira/browse/CASSANDRA-15641)

Fixes #6016
2020-03-17 16:55:11 +03:00
Pavel Emelyanov
86c712a340 test: Split view_schema_test
Detach *partition_key* and *clustering_key* ones into own files.
The resultint 2 tests run ~4 minutes each, the leftover ones
complete within 11 minutes. The same -- the goal to run out of
14 minutes is reached, further splitting needs more thinking
than just wildcarding.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-03-16 20:27:45 +03:00
Pavel Emelyanov
e848d63510 test: Split cql_query_test
This detaches *like_operator*, *group_by*, *functions*
and *large* cases into own files. The split is not
uniform -- the resulting 4 tests run less that 3 minutes
each,  what's left in the origin runs ~11 minutes. But
since the goal was to get out of 14 minutes threshold
and this file contains 126 cases (the champion) so I
just did "wildcard" selection that worked.

It also required moving require_rows() helpers into a
local header.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-03-16 20:27:45 +03:00
Pavel Emelyanov
3fbd88b226 test: Split mutation_reader_test
Detach test_multishard_combining_reader_as_mutation_source into
individual file.

This particular test runs ~13 minutes. What's left in the origin
completes a bit faster.

The split also requires moving the reader_lifecycle_policy and
the dummy_partitioner into lib/

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-03-16 20:27:44 +03:00
Pavel Emelyanov
3577fa2bb8 test: Split sstable_mutation_test
Detach test_schema_changes and test_sstable_conforms_to_mutation_source
into individual files. These two take ~10 minutes each, what's left in
origin finishes within 4 minutes alltogether.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-03-16 20:26:34 +03:00
Pavel Emelyanov
5b86f4be9a test: Split secondary_index test
Detach test_index_with_paging into individual file.

This particular test-case is the longest one in the sute,
it takes ~14 minutes to run, further splitting of this
test is pointless (for now) and all subsequent splits in
this set just make the resulting times less than this.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-03-16 20:26:34 +03:00
Pavel Emelyanov
14de126ff8 migration_manager: Run background schema merge in gate
The call for merge_schema_from in some cases is run in the
background and thus is not aborted/waited on shutdown. This
may result in use-after-free one of which is

merge_schema_from
 -> read_schema_for_keyspace
     -> db::system_keyspace::query
         -> storage_proxy::query
             -> query_partition_key_range_concurrent

in the latter function the proxy._token_metadata is accessed,
while the respective object can be already free (unlike the
storage_proxy itself that's still leaked on shutdown).

Related bug: #5903, #5999 (cannot reproduce though)
Tests: unit(dev), manual start-stop
       dtest(consistency.TestConsistency, dev)
       dtest(schema_management, dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Reviewed-by: Pekka Enberg <penberg@scylladb.com>
Message-Id: <20200316150348.31118-1-xemul@scylladb.com>
2020-03-16 17:41:23 +01:00
Avi Kivity
342c967b6a Merge "Introduce compacting reader" from Botond
"
Allow adding compacting to any reader pipeline. The intended users are
streaming and repair, with the goal to prevent wasting transfer
bandwidth with data that is purgeable.
No current user in the tree.

Tests: unit(dev), mutation_reader_test.compacting_reader_*(debug)
"

* 'compacting-reader/v3' of https://github.com/denesb/scylla:
  test: boost/mutation_reader_test: add unit test for compacting_reader
  test: lib/flat_mutation_reader_assertions: be more lenient about empty mutations
  test: lib/mutation_source_test: make data compaction friendly
  test: random_mutation_generator: add generate_uncompactable mode
  mutation_reader: introduce compacting_reader
2020-03-16 16:41:50 +02:00
Botond Dénes
837b79c265 test: boost/mutation_reader_test: add unit test for compacting_reader 2020-03-16 13:58:13 +02:00
Botond Dénes
3b482af33d test: lib/flat_mutation_reader_assertions: be more lenient about empty mutations
When expecting a mutation that compacts to an empty one, allow it to be
not produced at all. After all, compaction normally doesn't even emits
empty partitions.
2020-03-16 13:58:13 +02:00
Botond Dénes
1ab45e15a0 test: lib/mutation_source_test: make data compaction friendly
Currently the mutation source test suite may generate data that is
compactable. This poses a problem for the next patch, where we want to
use it to test `compacting_reader` a reader which compacts data as it
reads it. When the input is compactable, this will introduce artificial
differences, failing the tests.
To allow also testing such readers, make sure data is not compactable,
i.e. compacting it will not change it.
The goal of the mutation source test suite is not to exercise compaction
logic, so this will not take anything away from its value.
2020-03-16 13:58:13 +02:00
Botond Dénes
c4fab16723 test: random_mutation_generator: add generate_uncompactable mode
The random mutation generator currently generates data and tombstones
with random timestamps selected from a pre-determined range. This
results in mutations where tombstones often cover each other and data.
There is nothing wrong with this, as this is how real data is too.
However for certain tests this is problematic, as compacting the
mutations will result in a different mutations. To cater for these users
too, introduce a `generate_uncompactable` option. When set to `yes`, the
generated mutations will be uncompactable, i.e. no tombstone will cover
lower-level tombstones and no tombstone will cover data. The mutations
will not change after compacted.
2020-03-16 13:58:13 +02:00
Botond Dénes
8286a0b1bd mutation_reader: introduce compacting_reader
Compacting reader compacts the output of another reader on-the-fly.
Performs compaction-type compaction (`compact_for_sstables::yes`).
It will be used in streaming and repair to eliminate purgeable data from
the stream, thus prevent wasting transfer bandwidth.
2020-03-16 13:58:13 +02:00
Nadav Har'El
35d95d6887 merge: Add postimage implementation
Merged pull request https://github.com/scylladb/scylla/pull/5996 from
Calle Wilund:

Fixes #4992

Implements post-image support by synthesizing it from
pre-image + delta.

Post-image data differs from the delta data in two ways:

1.) It merges non-atomics into an actual result value
2.) It contains all columns of the row, not just
those affected by the update.

For a non-atomic field, the post-image value of a column
is either the pre-image or the delta (maybe null)

Tested by adding post-image checks to pre-image test
and collection/udt tests
2020-03-16 13:42:07 +02:00
Calle Wilund
0a3383c090 cdc: Add postimage implementation
Fixes #4992

Implements post-image support by synthesizing it from
pre-image + delta.

Post-image data differs from the delta data in two ways:

1.) It merges non-atomics into an actual result value
2.) It contains _all_ columns of the row, not just
    those affected by the update.

For a non-atomic field, the post-image value of a column
is either the pre-image or the delta (maybe null)

Tested by adding post-image checks to pre-image test
and collection/udt tests
2020-03-16 09:21:06 +00:00
Calle Wilund
40114f8233 cql3::untyped_result_set: Add bytes_view_opt access to fields
For quick access and convenient live-checks
2020-03-16 09:21:06 +00:00
Calle Wilund
ca7046256f schema: Add "columns" accessor for columns by kind
To prevent switch-code everywhere.
2020-03-16 09:21:06 +00:00
Avi Kivity
ee9df91a76 Merge "Allow setting partitioner per table" from Piotr
"
This PR makes it possible to enable the usage of different partitioner for each table. If no table-specific partitioner is set for a given table then a default partitioner is used.

The PR is composed of the following parts:

 - Introduction of schema::get_partitioner that still returns dht::global_partitioner
 - Replacement of all the usage of dht::global_partitioner with schema::get_partitioner
 - Making it possible to set table-specific partitioner in a schema_builder
 - Remove all the places that were setting default partitioner except for main.cc (mostly tests)
 - Move default partitioner from i_partitioner to schema.cc and hide it from the rest of the codebase
 - Remove dht::global_partitioner

After this PR there's no such thing as global partitioner at all. There is only a default partitioner but it still has to be accessed through schema::get_partitioner.

There are some intermediate states in which i_partitioner is stored as shared_ptr in the schema but the final version keeps it by const&.

The PR does not enable per table partitioner end-to-end. Just the internals of the single node are covered. I still have to deal with:

 - Making sure a table has the same partitioner on each node
 - Allowing user to set up a table-specific partitioner on table
 - Signal driver about what partitioner is used by a given table
 - Persist partitioner info for each table that does not use default partitioner.
Fixes #5493

Tests: unit(dev, release, debug), dtest(byo)
"

* 'per_table_partitioner' of https://github.com/haaawk/scylla:
  schema: drop optional from _partitioner field
  make_multishard_combining_reader: stop taking partitioner
  split_range_to_single_shard: stop taking partitioner as argument
  tests: remove unused murmur3 includes
  partitioner: move default_partitioner to schema.cc
  partitioner: hide dht::default_partitioner
  schema: include partitioner name in scylla tables mutation
  schema: make it possible to set custom partitioner
  scylla_tables: add partitioner column
  schema_features: add PER_TABLE_PARTITIONERS feature
  features: add PER_TABLE_PARTITIONERS feature
2020-03-16 11:13:47 +02:00
Avi Kivity
cb523c48cd Update seastar submodule
* seastar 47d929dd1...3c498abca (5):
  > reactor: Use do_with to save stack space
  > reactor: Extract code into a schedule_retry helper
  > reactor: Move an io_event buffer out of the stack
  > temporary_buffer: fix typo in argument type in comparison operators
  > tests: tls_test: add missing include <iostream>
2020-03-16 11:02:50 +02:00
Rafael Ávila de Espíndola
69874f4330 feature_service: Remove default constructor
This makes user that feature_config_from_db_config is used for both
tests and main.cc.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200312153453.37282-2-espindola@scylladb.com>
2020-03-16 11:01:15 +02:00
Rafael Ávila de Espíndola
7c26eb61a3 feature_service: Initialize local variable
The use of an uninitialized variable was not being noticed because
this is only used by main.cc.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200312153453.37282-1-espindola@scylladb.com>
2020-03-16 11:01:15 +02:00
Rafael Ávila de Espíndola
517a01a3f6 utils: Use sstring as keys in nonstatic_class_registry
Now that seastar::string::compare has been updated, it is possible to
use sstring for this.

This reverts commit 01fe766f1f.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200311005219.280737-1-espindola@scylladb.com>
2020-03-16 11:01:15 +02:00
Rafael Ávila de Espíndola
624573a219 configure: Warn on large stacks
This adds a warning with a different limit in each mode. The limit is
picked as 1KiB lower than the value where no warning would be print.

This makes it easy to spot the worse offender. With that we can either
fix it or silence the warning once we are sure we can handle large
frames in that context.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200311205300.324383-1-espindola@scylladb.com>
2020-03-16 11:01:15 +02:00
Piotr Sarna
f43e68b383 alternator: hook admission control to alternator server
From now on, alternator requests use the memory limiter semaphore
to control the amount of memory used by alternator requests.
2020-03-16 08:43:49 +01:00
Piotr Sarna
7eb6d5545d alternator: add addmission control stats entry
The entry will be bumped if admission control was forced
to block the request from being served.
2020-03-16 07:44:26 +01:00
Piotr Sarna
a1ea650d83 alternator: add memory limiter to alternator server
With the memory limiter semaphore, the server will be able to apply
admission control to alternator requets.
2020-03-16 07:44:26 +01:00
Piotr Sarna
781fbe8070 alternator: add service permit to callbacks
As a first step towards introducing admission control, the API
of alternator callbacks is extended with an additional 'permit'
parameter.
2020-03-16 07:44:25 +01:00
Piotr Sarna
cb5fded9c2 storage_service: add memory limiter semaphore getter
The memory limiter semaphore is going to be useful for limiting
alternator memory as well, so it's hereby exposed via a getter.
2020-03-16 07:34:23 +01:00
Raphael S. Carvalho
458ef4bb06 sstables: Fix incorrect calculation of Compaction Backlog
The bug is that we failed to implement this part of the formula:
(T - C) * log4(T)

We were incorrectly implementing it as:
(T - C) * log4(T - C)

So it could result in a backlog being calculated as negative when it
should actually be positive, or backlog being lower than expected.
BTW, we do protect against negative backlog after commit 3e08bd17f0.

Given that STCS backlog tracker is inherited by TWCS and LCS trackers,
all compaction strategies are affected.

The formula to calculate the aggregate backlog is:
A = (T - C) * log4(T) - Sum(i = 0...N) { (Si - Ci)* log4(Si) }.

For example, negative backlog is calculated on a tested scenario where T
was 3129, C was 2337 and Sum(i = 0...N) { (Si - Ci)* log4(Si) } resulted
in 4222.53.
(T - C) * log4(T - C) = (3129 - 2337) * log4(3129 - 2337) = 3813.23
So backlog is negative because A = 3813.23 - 4222.53 = -409.302

But it should actually be calculated as follow:
(T - C) * log4(T) = (3129 - 2337) * log4(3129) = 4598.15
And the correct backlog is positive, as A = 4598.15 - 4223.53 = 375.621

Fixes #6021.

tests: unit(dev)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200315153711.23302-1-raphaelsc@scylladb.com>
2020-03-15 18:16:01 +02:00
Kamil Braun
aa72a1c556 cql3: when altering table, keep old values of unchanged extensions
When the user performed

alter ks.t with compaction = {...}

the values of most other options, which were not specified in the
statement, e.g. compression, were left unchanged. That wasn't true for
extension options however: for example, the "cdc" option was removed.

This commit fixes the behavior to keep the old values of extension
options not specified in the alter statement.
2020-03-15 17:45:30 +02:00
Piotr Dulikowski
b1e8170bf9 cdc: add tracing
Adds information about the stages of CDC mutation augmentation to
tracing sessions.
2020-03-15 11:54:10 +01:00
Asias He
7ac9e0f2a1 gossip: Print CDC_STREAMS_TIMESTAMP correctly
I saw UNKNOWN application state in the logs:

INFO  2020-03-06 11:09:48,931 [shard 0] storage_service - Update
system.peers table: endpoint=127.0.0.2, app_state=CACHE_HITRATES, versioned_value=Value(,14)
INFO  2020-03-06 11:09:48,931 [shard 0] storage_service - Update
system.peers table: endpoint=127.0.0.2, app_state=SCHEMA_TABLES_VERSION, versioned_value=Value(3,15)
INFO  2020-03-06 11:09:48,931 [shard 0] storage_service - Update
system.peers table: endpoint=127.0.0.2, app_state=RPC_READY, versioned_value=Value(0,16)
INFO  2020-03-06 11:09:48,931 [shard 0] storage_service - Update
system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(,17)
INFO  2020-03-06 11:09:48,931 [shard 0] storage_service - Update
system.peers table: endpoint=127.0.0.2, app_state=SHARD_COUNT, versioned_value=Value(1,30)
INFO  2020-03-06 11:09:48,931 [shard 0] storage_service - Update
system.peers table: endpoint=127.0.0.2, app_state=IGNOR_MSB_BITS, versioned_value=Value(12,31)
INFO  2020-03-06 11:09:48,931 [shard 0] storage_service - Update
system.peers table: endpoint=127.0.0.2, app_state=UNKNOWN, versioned_value=Value(1583371936128,20)

It turned out it was CDC_STREAMS_TIMESTAMP.

$ nodetool gossipinfo|grep 1583371936128
  X8:1583371936128
  X8:1583371936128

Fixes #5992
2020-03-15 11:51:35 +01:00
Piotr Jastrzebski
5bbb826c49 schema: drop optional from _partitioner field
Always set the field to the default value if no
table specific partitioner has been set.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-15 10:25:21 +01:00
Piotr Jastrzebski
924ed7bb1c make_multishard_combining_reader: stop taking partitioner
The function already takes schema so there's no need
for it to take partitioner. It can be obtained using
schema::get_partitioner

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-15 10:25:20 +01:00
Piotr Jastrzebski
4b7fb323c3 split_range_to_single_shard: stop taking partitioner as argument
The function already takes schema so we don't need partitioner.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-15 10:25:20 +01:00
Piotr Jastrzebski
f99fd35f53 tests: remove unused murmur3 includes
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-15 10:25:20 +01:00
Piotr Jastrzebski
22daa262ee partitioner: move default_partitioner to schema.cc
Make it inaccessible to other compilation units.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-15 10:25:20 +01:00
Piotr Jastrzebski
7064f6b831 partitioner: hide dht::default_partitioner
Remove last usage of this global outside i_partitioner.cc
and hide it inside the compilation unit.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-15 10:25:20 +01:00
Piotr Jastrzebski
57b69fb804 schema: include partitioner name in scylla tables mutation
There are two results of this patch:
1. New partitioner name column is persited on node's disk in scylla_tables
2. New partitioner name column is included into schema digest

This is achieved by including this new column in scylla tables mutation.
For that we:
1. Add partitioner name to the result of make_scylla_tables_mutation.
   If table does not have a specific partitioner set and uses default
   partitioner then we don't include the name of such default partitioner.
   Only the name of custom partitioner is added if a table has one.
2. In create_table_from_mutations we check whether scylla tables mutation
   has a partitioner name set. If so then we use it as a parameter for
   schema_builder.

Note that previous patches have ensured that this new column will be included
into schema digest only after the whole cluster supports per table partitioners.
Before that, during rolling upgrade, new partitioner name column is hidden and
not shared with other nodes.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-15 10:25:20 +01:00
Piotr Jastrzebski
1d6cec1b0a schema: make it possible to set custom partitioner
schema_builder::with_partitioner can be used now to
set custom partitioner on a table.
If no such partitioner is set, global partitioner is
still used.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-15 10:25:20 +01:00
Piotr Jastrzebski
f83ff8fda1 scylla_tables: add partitioner column
Following commits make it possible to set a specific
partitioner for a table. We want to persist that information
and include it into schema digest. For that a new column
in scylla_tables is needed. This commit adds such column.

We add the new column to scylla_tables because it's a Scylla
specific extension.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-15 10:25:20 +01:00
Piotr Jastrzebski
782f2caf41 schema_features: add PER_TABLE_PARTITIONERS feature
With per table partitioners, partitioner name will be a part
of table schema. To allow rolling upgrade we need to perform
special logic that hides new partitioner name schema column
during the upgrade. This commit adds new schema feature that
controls this logic.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-15 10:25:20 +01:00
Piotr Jastrzebski
90df9a44ce features: add PER_TABLE_PARTITIONERS feature
This new feature is required because we now allow
setting partitioner per table. This will influence
the digest of table schema so we must not include
partitioner name into the digest unless we know that
the whole cluster already supports per table partitioners.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-15 10:25:20 +01:00
Botond Dénes
5207f530ba scylla-gdb.py: scylla smp-queues: ignore unresolvable/unmatching symbols
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200313160444.320253-1-bdenes@scylladb.com>
2020-03-15 10:41:16 +02:00
Botond Dénes
a85c3aa839 scylla-gdb.py: introduce sharded_local convenience function
To conveniently retrieve the local instance of a sharded object.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200313160106.319743-1-bdenes@scylladb.com>
2020-03-15 10:41:16 +02:00
Botond Dénes
0e9df01ba3 scylla-gdb.py: downcast_vptr(): make compatible with python < 3.6
Subscript operation `__getitem__()` was only added to re.match objects
in 3.6. To support previous versions, use `groups()` method to obtain
the desired group.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200313160025.319464-1-bdenes@scylladb.com>
2020-03-15 10:41:15 +02:00
Nadav Har'El
635e6d887c materialized views: fix corner case of view updates used by Alternator
While CQL does not allow creation of a materialized view with more than one
base regular column in the view's key, in Alternator we do allow this - both
partition and clustering key may be a base regular column. We had a bug in
the logic handling this case:

If the new base row is missing a value for *one* of the view key columns,
we shouldn't create a view row. Similarly, if the existing base row was
missing a value for *one* of the view key columns, a view row does not
exist and doesn't need to be deleted.  This was done incorrectly, and made
decisions based on just one of the key columns, and the logic is now
fixed (and I think, simplified) in this patch.

With this patch, the Alternator test which previously failed because of
this problem now passes. The patch also includes new tests in the existing
C++ unit test test_view_with_two_regular_base_columns_in_key. This tests
was already supposed to be testing various cases of two-new-key-columns
updates, but missed the cases explained above. These new tests failed
badly before this patch - some of them had clean write errors, others
caused crashes. With this patch, they pass.

Fixes #6008.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200312162503.8944-1-nyh@scylladb.com>
2020-03-15 07:57:33 +01:00
Avi Kivity
07ddbf6e54 Merge "Reduce our dependence on sstring" from Rafael
"
It doesn't look like we will be able to switch to std::string just
yet, but when it is not too inconvenient we can try to reduce our
dependence so that attempting the switch again in the future is
easier.
"

* 'espindola/sstring-api' of https://github.com/espindola/scylla:
  redis: Use scattered_message::append(std::string_view)
  everywhere: Use uninitialized_string instead of sstring::initialized_later
  compressor: Add an explicit cast to const sstring&
  everywhere: Be more explicit that we don't want std::make_shared
  cql3: Don't use sstring::reset
  everywhere: Don't assume sstring::begin() and sstring::end() are pointers
2020-03-14 16:29:42 +02:00
Avi Kivity
6b747f4673 database: avoid creating thread in make_directory_for_column_family()
make_directory_for_column_family() is used in a parallel_for_each() in
parse_system_tables(). Because parallel_for_each does not preempt
in the initial execution of its input function, and because each thread
allocates 128k for the stack, we end up allocating many hundreds of
megabytes if there are many tables.

This happens early during boot and will only cause problems if
there are 5,000 tables per gigabyte of shard memory, and unlikely
combination that will probably fail later, but still it is better to
avoid unnecessary large allocations.

This was developed in order to fix #6003, until it was discovered that
c020b4e5e2 ("logalloc: increase capacity of _regions vector
outside reclaim lock") is the real fix.

Message-Id: <20200313093603.1366502-1-avi@scylladb.com>
2020-03-13 13:46:45 +02:00
Rafael Ávila de Espíndola
a1ca83b067 gms: Fix static initialization order problem
In test_services.cc there is

gms::feature_service test_feature_service;

And the feature_service constructor has

 , _lwt_feature(*this, features::LWT)

But features::LWT is a global sstring constructed in another file.

Solve the problem by making the feature strings constexpr
std::string_view.

I found the issue while trying to benchmark the std::string switch.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Acked-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200309225749.36661-1-espindola@scylladb.com>
2020-03-13 12:37:22 +02:00
Botond Dénes
13e20fe6be scylla-gdb.py: scylla_memory: update w.r.t. per-sg coordinator stats
Since b783d40aa storage-proxy maintains separate coordinator stats per
scheduling group. This broke scylla_memory, which was still trying to
access the old global stats. Update it to print the new per-scheduling
group stats when they are available and the old global ones when not.
Scheduling groups for which all relevant metrics are 0 are omitted from
the printout to reduce noise.
2020-03-13 10:57:51 +02:00
Botond Dénes
ca84c2f566 scylla-gdb.py: scylla_memory: move coordinator code to print_coordinator_stats()
This code will have to be revamped. While at it move it to its own
method to reduce the clutter in `invoke()`.
2020-03-13 10:54:01 +02:00
Avi Kivity
7311d1b177 Update seastar submodule
* seastar 664c911b4c...47d929dd1b (6):
  > sstring: Simplify operator=
  > sstring: Deprecate reset
  > sstring: Pass string_view to compare
  > sstring: Move exception code out of line
  > reactor: remove unused variable
  > reactor: always initialize smp_poller
2020-03-12 21:37:05 +01:00
Piotr Sarna
e8871181eb scripts: add a script for pulling GitHub pull requests
In order to avoid the UI merge button which tends to
mess up commit authors, a simple script for pulling
a PR from GitHub is added.
Example usage:
 $ git fetch; git checkout origin/next
 $ ./scripts/pull_github_pr.sh 6007

Message-Id: <1fa79c8be47b5660fc24a81fc0ab381aa26d98af.1584014944.git.sarna@scylladb.com>
2020-03-12 21:37:05 +01:00
Raphael S. Carvalho
34426d1497 sstables: Fix off-by-one when checking for max_data_segregation_window_count
Make sure max size of known windows will respect max_d_s_w_c.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200305165014.16022-1-raphaelsc@scylladb.com>
2020-03-12 14:11:18 +02:00
Nadav Har'El
8e4520b2b3 alternator-test: add xfailing test for issue 6008
This patch adds a test, test_gsi.py::test_gsi_missing_attribute_3,
reproducing issue #6008. The issue is about a GSI with *two*
regular base columns becoming key columns in a view, and we have
a write failure when writing an item with one of these attributes
missing.

The test passes on DynamoDB, currently xfails on Alternator.

Refs #6008.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200312064131.16046-1-nyh@scylladb.com>
2020-03-12 10:07:58 +01:00
Nadav Har'El
77444a38a1 alternator: allow consistent reads on LSI - but not on GSI
Recently, Materialized Views were modified (see issue #4365) so that local
view updates (when both base and view replicas are the same node) are
synchronous. In particular, when the view's partition key is the same as
the base table's, view writes are synchronous: A write now only returns
after CL copies of the view data have been written.

Alternator's LSI have exactly this case (same partition key as the base).
This makes strongly-consistent (CL=LOCAL_QUORUM) reads in Alternator work
correctly, so we update the documentation accordingly to no longer say
that we don't support this DynamoDB feature.

However unlike LSIs, for GSIs strongly-consistent reads are still not
supported, and should not be supported (they are also not supported by
DynamoDB). Such reads should generate an error. So this patch fixes this
too. A GSI test which tested that strongly consistent reads are forbidden,
which used to xfail, now passes so the patch removes the "xfail".

Finally, we can simplify the LSI tests by using consistent reads instead of
eventually-consistent reads with retries. Beyond simplifying the test, it's
also an opportunity to *use* strongly-consistent reads and make sure that
they work (while, as mentioned above, similar reads for GSIs are refused).

Fixes #5007

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200311170446.28611-1-nyh@scylladb.com>
2020-03-12 09:18:00 +01:00
Takuya ASADA
086f0ffd5a scylla_raid_setup: create missing directories
We need to create hints, view_hints, saved_caches directories
on RAID volume.

Fixes #5811
2020-03-12 09:29:29 +02:00
Takuya ASADA
399ff24efd docker: apply scylla-jmx sysconfig file on scylla-jmx service
Apply scylla-jmx sysconfig file on scyla-jmx service, to allow customize
jmx parameter.

Fixes #5939
2020-03-12 09:27:23 +02:00
Avi Kivity
86415cf98a Update seastar submodule
* seastar 95f4277c16...664c911b4c (4):
  > tls_test: Use uninitialized_string instead of initialized_later
  > tls: Fix race and stale memory use in delayed shutdown
Fixes #5759 (maybe)
  > tls: Re-enable TLS test and fix build+run
  > tls: Set server name for client connection if available
2020-03-11 19:25:36 +02:00
Avi Kivity
c020b4e5e2 logalloc: increase capacity of _regions vector outside reclaim lock
Reclaim consults the _regions vector, so we don't want it moving around while
allocating more capacity. For that we take the reclaim lock. However, that
can cause a false-positive OOM during startup:

1. all memory is allocated to LSA as part of priming (2baa16b371)
2. the _regions vector is resized from 64k to 128k, requiring a segment
   to be freed (plenty are free)
3. but reclaiming_lock is taken, so we cannot reclaim anything.

To fix, resize the _regions vector outside the lock.

Fixes #6003.
Message-Id: <20200311091217.1112081-1-avi@scylladb.com>
2020-03-11 12:29:31 +02:00
Botond Dénes
931d2fca45 scylla-gdb.py: std_list: __len__(): support C++11 ABI
In theory the C++11 ABI should already have a size field but it does not
in the version of the C++ standard library shipped with scylla 2019.1.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200225162337.112582-1-bdenes@scylladb.com>
2020-03-11 10:51:05 +02:00
Botond Dénes
0909dd3d11 scylla-gdb.py: scylla_sstables: fix copypasta in name passed to argparse
The description is probably from the command this snippet was copied
from originally.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200310141025.90051-1-bdenes@scylladb.com>
2020-03-11 10:49:34 +02:00
Botond Dénes
10944689bc scylla-gdb.py: resolve(): don't attempt to match failed symbols
Currently if `startswith` is passed to `resolve()` it will
unconditionally try to match the resolved symbol name against it. This
will of course fail when the symbols fails to resolve and `name` is
`None`. Return early when this happens to prevent the unnecessary
prefix matching.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200310140918.88928-1-bdenes@scylladb.com>
2020-03-11 10:48:44 +02:00
Botond Dénes
0da517ca93 scylla-gdb.py: get_text_range(): make compatible with >=3.0
The current method of obtaining the text range based on a known vptr
(`reactor::_backend`) was based on branch-2019.1, where
`reactor::_backend` is a value member. However in >=3.0
`reactor::_backend` is a `std::unique_ptr<>`. Adapt the code to work for
both.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200310135957.86261-1-bdenes@scylladb.com>
2020-03-11 10:46:40 +02:00
Nadav Har'El
8d161cac87 merge: Allow synchronous view updates for local views
Merged patch series by Piotr Sarna:

This series makes view updates synchronous, as long as the update
is going to be applied locally.
With this feature, local secondary indexes and, more generally,
materialized views with partition keys same as in the base table
could enjoy more robust consistency.
This series comes with a cql test, not common for materialized
views, which usually require eventual consistency checks. With
synchronous updates however, the test can simply check view values
right after updating the base table.

Fixes #4365
Refs #5007
Tests: unit(dev), manually via inserting sleeps and debug messages,
       to make sure that local view updates are actually waited for

Piotr Sarna (4):
  db,view: drop default parameter for mutate_MV::allow_hints
  db,view: move putting view updates to background to mutate_MV
  db,view: perform local view updates synchronously
  test: add a simple test for synchronous local view updates
2020-03-11 10:29:16 +02:00
Piotr Sarna
8d2555673f test: add a simple test for synchronous local view updates
With synchronous local view updates enabled, local materialized views
can be queried right after base table insertions, without the risk
of reading stale values.
2020-03-11 09:15:57 +01:00
Piotr Sarna
2061e6a9cc db,view: perform local view updates synchronously
Local view updates (updates applied to a local node,
without remote communication) are from now on performed
synchronously - which adds consistency guarantees, as a local
write failure will be returned to the client instead of being
silently ignored.
2020-03-11 09:05:56 +01:00
Piotr Sarna
fd49fd773c db,view: move putting view updates to background to mutate_MV
Currently, launching view updates as an asynchronous background job
is done via not waiting for mutate_MV() future in
table::generate_and_propagate_view_updates. That has a big downside,
since mutate_MV() handles *all* view updates for *all* views of a table,
so it's not possible to wait for each view independently.
Per-view granularity is required in order to implement synchronous
view updates of local views - because then we'll synchronously
wait for all views that write to a local node (due to having a matching
partition key with the base), while remote view updates will still
be sent asynchronously.
In order to do that, instead of not waiting for mutate_MV,
we do wait for it properly, but instead launch the asynchronous,
unwaited-for futures inside mutate_MV.
Effectively that means no changes for view updates so far - all updates
will be fired in the background. Later, another patch will introduce
a way to wait for selected updates to finish.
2020-03-11 09:05:56 +01:00
Piotr Sarna
3b3659e8cd db,view: drop default parameter for mutate_MV::allow_hints
Default parameters are considered harmful, and as part of a cleanup
before editing view.cc code, a default value for allow_hints parameter
is removed.
2020-03-11 09:05:56 +01:00
Rafael Ávila de Espíndola
d5bcb5a974 redis: Use scattered_message::append(std::string_view)
This just moves the copy to append instead of doing it in the caller.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-03-10 13:18:54 -07:00
Rafael Ávila de Espíndola
80d969ce31 everywhere: Use uninitialized_string instead of sstring::initialized_later
This is just a trivial wrapper over initialized_later when using
sstring, but also works when std::string is used.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-03-10 13:17:49 -07:00
Rafael Ávila de Espíndola
76f4fee65b compressor: Add an explicit cast to const sstring&
Some difference on how exactly the operator== is declared for sstring
versus std::string requires this change if we convert from sstring to
std::string.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-03-10 13:13:48 -07:00
Rafael Ávila de Espíndola
c0072eab30 everywhere: Be more explicit that we don't want std::make_shared
If sstring is made an alias to std::string ADL causes std::make_shared
to be found. Explicitly ask for ::make_shared.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-03-10 13:13:48 -07:00
Rafael Ávila de Espíndola
ad9f17bd92 cql3: Don't use sstring::reset
There is no reset in std::string, so don't depend on a sstring only
feature.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-03-10 13:13:48 -07:00
Rafael Ávila de Espíndola
caef2ef903 everywhere: Don't assume sstring::begin() and sstring::end() are pointers
If we switch to using std::string we have to handle begin and end
returning iterators.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-03-10 13:13:48 -07:00
Avi Kivity
0cb7182768 Update seastar submodule
* seastar 5eaec672a2...95f4277c16 (1):
  > Merge "Add an option for making sstring an alias to std::string" from Rafael
2020-03-10 18:38:37 +02:00
Gleb Natapov
cd73f552b9 storage_service, database: do not move sharded services
It may be not safe to move sharded services, so it will be prohibited in
the future seastar update. Remove all current cases where we do it.

Fixes #5814.

Message-Id: <20200301095423.GY434@scylladb.com>
2020-03-10 12:51:02 +02:00
Tomasz Grabiec
3548e85ff7 Merge "features: Properly resolve when_enabled futures on stop" from Pavel E.
If the feature service is stopped without enabling some features,
the latrer may end up with "broken promise" exception on futures
attached to the _pr promise. Fix this by switching the only user
of it onto 'listener' API and remove future-based one.

Tests: unit(debug), manual start-stop and aborted-start
2020-03-10 10:09:24 +02:00
Juliusz Stasiewicz
3cc3233281 test/cdc: test that LWT generates CDC logs
Tests #5952
Refs #5869
2020-03-10 08:33:49 +01:00
Raphael S. Carvalho
899bb230e2 sstable_resharding_test: fix sstable_resharding_strategy_tests with odd smp count
leveled_compaction_strategy_strategy::get_resharding_jobs() returns compaction
jobs, each containing at most smp::count ssts, so calculation is wrong if
smp count is an odd number.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Acked-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200305161003.14424-1-raphaelsc@scylladb.com>
2020-03-09 17:52:53 +02:00
Raphael S. Carvalho
d895f5e131 sstables/stcs: kill FIXME
For the purpose of determining size tiers, it doesn't matter whether
bytes_on_disk() or data_size() is used.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200302142513.10136-1-raphaelsc@scylladb.com>
2020-03-09 15:47:48 +02:00
Avi Kivity
8af6dabbf0 Merge "Decouple cql_config from storage_service" from Pavel E
"
The cql_configu is needed by storage_service to feed it to
thrift/transport servers. These servers, in turn, put the
config onto query_options. The final goal of this config
reference is the guts of query_processor (but currently it's
only used by restrictions)

This way is rather long and confusing. It seems more natural
to keep the cql_config on it's main "user" -- query processor.

This patch set does so. However, in order to push the config
into its current usage places a huge refactoring is needed --
most of the classes in cql3/statements and cql3/restrictions.
It's much more handy to contunue keeping it via query_options,
so the query_processor is equipped with the method to return
the reference on the config to those initializing query_options.

Tests: unit(debug)
"

* 'br-clean-client-services-from-cql-config-2' of https://github.com/xemul/scylla:
  storage_service: Forget cql_config
  transport: Forget cql_config
  thrift: Forget cql_config
  query_processor: Carry reference on cql_config
2020-03-09 15:06:59 +02:00
Calle Wilund
5c743bfd53 cdc: rename inner "process_cells" to avoid confusion
Two lambdas should not share name in same function.
2020-03-09 13:06:32 +00:00
Konstantin Osipov
9c009441e0 test.py: do not override environment options
Do not reset user-defined environment options for ASAN with test.py
flags.
Message-Id: <20200306135714.3380-1-kostja@scylladb.com>
2020-03-09 14:56:09 +02:00
Piotr Dulikowski
5f652e58c1 cdc: allow dropping manually created tables with cdc log suffix
The is_log_for_some_table function incorrectly assumed that
database::find_schema would return a null pointer in case the queried
schema does not exist. This patch fixes that, and now this function
checks for existence of the schema using database::has_schema.

Tests: unit(dev)
2020-03-09 12:17:13 +01:00
Asias He
6a7c3f0af0 repair: Stop the nodes that have run repair_row_level_start
It is ok to run repair_row_level_stop unconditionally. The node that
hasn't received the repair_row_level_start will simply return an error
that the repair_meta_id is not found. To avoid the unnecessary
repair_row_level_stop verb, we can stop the nodes have run
repair_row_level_start. This also makes the error message less
confusing.

For example:

Before:

INFO 2020-03-09 15:55:43,369 [shard 0] repair - repair id 1 on shard 0
     failed: std::runtime_error (get_repair_meta: repair_meta_id 8 for
     node 127.0.0.4 does not exist)
INFO 2020-03-09 15:55:43,369 [shard 0] repair - repair id 1
     failed: std::runtime_error ({shard 0: std::runtime_error
     (get_repair_meta: repair_meta_id 8 for node 127.0.0.4 does not
     exist)})
WARN 2020-03-09 15:55:43,369 [shard 0] repair - repair id 1 to
     sync data for keyspace=ks, status=failed, keyspace does not exist
     any more, ignoring it: std::runtime_error ({shard 0:
     std::runtime_error (get_repair_meta: repair_meta_id 8 for node
     127.0.0.4 does not exist)})

After:

INFO 2020-03-09 16:09:09,217 [shard 0] repair - repair id 1 on shard 0 failed:
     std::runtime_error (Failed to repair for keyspace=ks, cf=cf,
     range=(9041860168177642466, 9044815446631222376])
INFO 2020-03-09 16:09:09,217 [shard 0] repair - repair id 1 failed:
     std::runtime_error ({shard 0: std::runtime_error (Failed to repair
     for keyspace=ks, cf=cf, range=(9041860168177642466,
     9044815446631222376])})
WARN 2020-03-09 16:09:09,217 [shard 0] repair - repair id 1 to sync data
     for keyspace=ks, status=failed, keyspace does not exist any more,
     ignoring it: std::runtime_error ({shard 0: std::runtime_error
     (Failed to repair for keyspace=ks, cf=cf,
     range=(9041860168177642466, 9044815446631222376])})

Refs #5942
2020-03-09 18:24:02 +08:00
Asias He
75cf255c67 repair: Ignore keyspace that is removed in sync_data_using_repair
When a keyspace is removed during node operations, we should not fail
the whole operation. Ignore the keyspace that is removed.

Fixes #5942
2020-03-09 18:24:02 +08:00
Pavel Emelyanov
0298a6270e storage_service: Forget cql_config
It needs the config purely to feed one into thrift/transport
server, since the latter two no longer needs one, neither does
the former.

As a nice side effect -- some tests no longer have to carry
the cql_config on board.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-03-09 11:58:06 +03:00
Pavel Emelyanov
1af8ab80eb transport: Forget cql_config
The cql_server already works with query_processor from
which it can get the cql_configu.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-03-09 11:57:30 +03:00
Pavel Emelyanov
d551f0323a thrift: Forget cql_config
The thrift handlers already mess with query_processor which
has the config in question.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-03-09 11:57:30 +03:00
Pavel Emelyanov
0a9a5a2dd7 query_processor: Carry reference on cql_config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-03-09 11:57:28 +03:00
Pavel Emelyanov
7f2fc837cb config: Place timeout_config() into own .cc file
It's a generic helper that's used by transport, thrift and
redis (this guy has own copy of the code).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200306114022.8070-1-xemul@scylladb.com>
2020-03-08 17:57:58 +02:00
Avi Kivity
de1b20ff7c Update seastar submodule
* seastar affc3a5107...5eaec672a2 (12):
  > test_thread_custom_stack_size_failure: Use a larger custom stack
  > test_thread_custom_stack_size: Use a larger custom stack
  > log: correct help message
  > perftune.py: verify NIC existence
  > Merge "Fix various memory issues in http" from Rafael
  > build: Fix IN_LIST usage
  > future: Disable -Wuninitialized on a particular memcpy
  > build: use IN_LIST for shorter cmake
  > build: check support of "-fstack-clash-protection" before using it
  > configure.py: Add "--verbose" flag
  > configure.py: Make "cmake" command line human-readable
  > net: dynamically adjust buffer sizes for posix connected_socket read operations
2020-03-08 17:34:16 +02:00
Benny Halevy
a89fb0abd9 main: log "Startup failed" message as error
To make it stand out and be detectable by dtests.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Acked-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200303160725.235959-1-bhalevy@scylladb.com>
2020-03-08 17:33:50 +02:00
Konstantin Osipov
ac6f64a885 locator: correctly select endpoints if RF=0
SimpleStrategy creates a list of endpoints by iterating over the set of
all configured endpoints for the given token, until we reach keyspace
replication factor.
There is a trivial coding bug when we first add at least one endpoint
to the list, and then compare list size and replication factor.
If RF=0 this never yields true.
Fix by moving the RF check before at least one endpoint is added to the
list.
Cassandra never had this bug since it uses a less fancy while()
loop.

Fixes #5962
Message-Id: <20200306193729.130266-1-kostja@scylladb.com>
2020-03-08 16:53:01 +02:00
Calle Wilund
0b34d88957 db::commitlog: Don't write trailing zero block unless needed
Fixes #5899

When terminating (closing) a segment, we write a trailing block
of zero so reader can have an empty region after last used chunk
as end marker. This is due to using recycled, pre-allocated
segments with potentially non-zero data extending over the point
where we are ending the segment (i.e. we are not fully filling
the segment due to a huge mutation or similar).

However, if we reach end of segment writing the final block
(typically many small mutations), the file will end naturally
after the data written, and any trailing zero block would in fact
just extend the file further. While this will only happen once per
segment recycled (independent on how many times it is recycled),
it is still both slightly breaking the disk usage contract and
also potentially causing some disk stalls due to metadata changes
(though of course very infrequent).

We should only write trailing zero if we are below the max_size
file size when terminating

Adds a small size check to commitlog test to verify size bounds.
(Which breaks without the patch)

Message-Id: <20200226121601.15347-2-calle@scylladb.com>
2020-03-08 16:51:53 +02:00
Konstantin Osipov
b4b08be0e1 test: add a test case for rare replication configurations
Introduce a test which checks how different CQL features (DML, LWT,
MV) work when no replicas are available (e.g. because
they are all in an unavailable data center).
Specifically the test checks that when we SELECT with IN clause
and there are no available replicas, there is no crash (#5935).

Message-Id: <20200306192521.73486-3-kostja@scylladb.com>
2020-03-08 15:11:08 +02:00
Konstantin Osipov
9827efe554 storage_proxy: do not touch all_replicas.front() if it's empty.
The list of all endpoints for a query can be empty if we have
replication_factor 0 or there are no live endpoints for this token.
Do not access all_replicas.front() in this case.

Fixes #5935.
Message-Id: <20200306192521.73486-2-kostja@scylladb.com>
2020-03-08 15:11:02 +02:00
Nadav Har'El
6febd4199e merge: cdc: on row delete, show the whole row as preimage
Merged pull request https://github.com/scylladb/scylla/pull/5980 by
Piotr Jastrzębski, based on https://github.com/scylladb/scylla/pull/5976
by Juliusz Stasiewicz:

"If base mutation has at least one row tombstone, its preimage log entry
displays all the base columns."

Fixes #5709

Tests: unit(dev)
2020-03-08 14:54:59 +02:00
Juliusz Stasiewicz
49f1a24472 tests/cdc: test preimage on row delete
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-08 13:27:49 +01:00
Juliusz Stasiewicz
68071d35ce cdc: on row delete display the entire row as preimage
If base mutation has at least one row tombstone, its preimage log
entry is constructed from all the base columns.

Fixes #5709

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-08 12:11:07 +01:00
Piotr Dulikowski
0e413efb48 cdc: correct static row preimage for case with no clustering row
In case a static and a clustering row is written at the same time, but
a clustering row with given key was not present, the preimage query was
incorrectly configured and no rows were returned. This resulted in an
empty preimage, while a preimage for static row should be present.

This patch fixes this and now the static row is correctly written to cdc
log in the case above.

Tests: unit(dev)
2020-03-08 09:25:45 +01:00
Piotr Sarna
395c7eeb98 Merge ' cdc: disallow creating nested cdc logs' from Piotr
This change disallows creating CDC log tables for already existing
CDC log tables. CDC logs nested in that way are not really useful
and do not work at the moment, therefore disallowing their creation
prevents confusion.

Fixes #5967
Tests: unit(dev)

* piodul/5967-disallow-nested-cdc-logs:
  cdc: disallow creating nested CDC logs
  cql_repl: register schema extensions
2020-03-08 09:22:59 +01:00
Juliusz Stasiewicz
e2b76fd559 cdc: move the extractor of pirow columns into separate method
Because it will be used more than once.
2020-03-06 17:54:42 +01:00
Piotr Sarna
be293523bd Merge 'Replace dht::global_partitioner() calls with...
... schema::get_partitioner and make schema::get_partitioner
return const&' from Piotr

Partitioners returned from get_partitioner are shared
and not supposed to be changed so let's use the type system
to enforce that.

dht::global_partitioner() is deprecated and will be removed
as soon as custom partitioners are implemented so it's best
to replace it with schema::get_partitioner.

Tests: unit(dev)

* hawk/global_partitioner_cleanup:
  schema: get_partitioner return const&
  compaction_manager: stop calling dht::global_partitioner()
  sstable_datafile_test: stop calling dht::global_partitioner()
2020-03-06 14:36:03 +01:00
Piotr Jastrzebski
54d24553bb schema: get_partitioner return const&
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-06 13:33:53 +01:00
Piotr Jastrzebski
22fac03184 compaction_manager: stop calling dht::global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-06 13:33:53 +01:00
Piotr Jastrzebski
08ebf1f69d sstable_datafile_test: stop calling dht::global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-06 13:33:53 +01:00
Piotr Jastrzebski
968177da04 cdc: store tokens in cdc description as longs
Previously the tokens were stored as strings
because token could have been represented in multiple ways.
Now token representation is always int64_t so we can
store them as ints in cdc description as well.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-06 11:59:59 +01:00
Piotr Dulikowski
f317283578 cdc: disallow creating nested CDC logs
This change disallows creating CDC log tables for already existing CDC
log tables. CDC logs nested in that way are not really useful and do not
work at the moment, therefore disallowing their creation prevents
confusion.
2020-03-06 10:47:13 +01:00
Piotr Dulikowski
75284eb2a5 cql_repl: register schema extensions
Alternator and CDC, apart from enabling their experimental features,
need to have their schema extensions registered. This patch adds missing
registration of schema extensions to cql_repl, so that cql tests written
with Alternator or CDC in mind will properly work.
2020-03-06 10:31:07 +01:00
Piotr Sarna
d1db198211 Merge ' Allow repeated LIKE on same column' from Dejan
Fixes #5902 by making the LIKE restriction keep a vector
of matchers and apply them all to the column value.

Tests: unit (dev)

* dekimir/multiple-likes:
  cql3: Allow repeated LIKE on same column
  cql3: Forbid calling LIKE::values()
  cql3: Move LIKE::_last_pattern to matcher
2020-03-06 09:55:54 +01:00
Piotr Sarna
22798f7b7b locator: fix validating replication factor
In order to properly validate not only network topology strategy,
but also other strategies, the checks are moved straight to
validate_replication_factor().
Also, the test case is extended with a too long integer
and a check for SimpleStrategy replication factor.

Fixes #3801
Tests: unit(dev)

Message-Id: <e0c3c3c36c589e1d440c9708a6dce820c111b8da.1583483602.git.sarna@scylladb.com>
2020-03-06 10:39:34 +02:00
Konstantin Osipov
848195125c test.py: check test xml output
Check that XML output of a test is valid and warn otherwise.

The following tests currently produce a warning:
boost/multishard_mutation_query_test

Message-Id: <20200305213501.52279-2-kostja@scylladb.com>
2020-03-06 10:05:28 +02:00
Piotr Sarna
6df132436f cql3: disallow range deletions for specific columns
Range deletions of specific columns are not well-defined
(range tombstones cover entire rows) and are forbidden
in Cassandra, so we follow suit.
This commit comes with a simple test.

Fixes #5728
Tests: unit(dev)
Message-Id: <896264f5f5790b9f96fcc18655ac3248a6abf37a.1583424131.git.sarna@scylladb.com>
2020-03-06 10:04:05 +02:00
Piotr Sarna
5b7a35e02b network_topology_strategy: validate integers
In order to prevent users from creating a network topology
strategy instance with invalid inputs, it's not enough to use
std::stol() on the input: a string "3abc" still returns the number '3',
but will later confuse cqlsh and other drivers, when they ask for
topology strategy details.
The error message is now more human readable, since for incorrect
numeric inputs it used to return a rather cryptic message:
    ServerError: stol()
This commit fixes the issue and comes with a simple test.

Fixes #3801
Tests: unit(dev)
Message-Id: <7aaae83d003738f047d28727430ca0a5cec6b9c6.1583478000.git.sarna@scylladb.com>
2020-03-06 09:50:33 +02:00
Piotr Sarna
30d2826358 Merge 'cdc: use cdc schema extension for storing...
... and reading cdc metadata' from Piotr

Currently, information on what cdc options are enabled
in a table - cdc metadata in short - is stored in two places:

    In cdc column of the system_schema.scylla_tables,
    In a cdc schema extension.

The former is used as a source of truth, i.e. a node reads cdc metadata
from that column, while the latter is used for cosmetic purposes
(e.g. cqlsh displays info on cdc based on this extension)
and is only written, but never read by the node.

Introducing the cdc column to scylla_tables made the logic
of schema agreement more complicated. As a first step of removing
this column, this PR makes the cdc schema extension as the
"source of truth" - a node will from now on read cdc metadata
from that extension.

The cdc column will be deprecated and removed in subsequent releases,
but it is left for now and will still be written to in order not to break
the logic of schema agreement.

Acked-by: Nadav Har-El <nyh@scylladb.com>

Refs: #5737
Tests: unit(dev), 2-node cluster upgrade under write load to a cdc-enabled table

* piodul/5737-cdc-schema-extension:
  schema: get cdc options from schema extensions
  alter_table_statement: fix indentation
  cf_prop_defs: initialize schema extensions externally
  cf_prop_defs: move checking of cdc support to ::validate
  cf_prop_defs: pass database& to ::validate, not db::extensions&
  unit tests: register cdc extension before tests
  cdc: construct cdc_options directly inside cdc_extension
  db::extensions: add shorthands for add_schema_extension
2020-03-05 16:31:40 +01:00
Piotr Dulikowski
861c7b5626 schema: get cdc options from schema extensions
Removes logic responsible for setting cdc_options from dedicated column
in scylla_tables, and uses the "cdc" schema extension instead.
2020-03-05 16:11:21 +01:00
Piotr Dulikowski
e98766dd81 alter_table_statement: fix indentation 2020-03-05 16:11:21 +01:00
Piotr Dulikowski
828077be5e cf_prop_defs: initialize schema extensions externally
Moves initialization of schema extensions outside of cf_prop_defs. This
allows to construct these extensions once, and use them several times in
cd_prop_defs' methods without caching or recalculating them several
times.
2020-03-05 16:11:21 +01:00
Piotr Dulikowski
0bdc22e33b cf_prop_defs: move checking of cdc support to ::validate
Validation of CDC options fits better into the `validate` method rather
than `apply_to_builder`.
2020-03-05 16:11:21 +01:00
Piotr Dulikowski
260c47d758 cf_prop_defs: pass database& to ::validate, not db::extensions&
Changes cf_prop_defs::validate function to take database& as an argument
instead of db::extensions&. This change will allow us to move the check
which asserts that the cluster supports CDC from `apply_to_builder` to
`validate` method.
2020-03-05 16:11:21 +01:00
Piotr Dulikowski
38b7f1ad45 unit tests: register cdc extension before tests
In the following commits, using cdc in tests will require registering
cdc extension explicitly in db config.
2020-03-05 16:11:20 +01:00
Piotr Dulikowski
0f4f48ef76 cdc: construct cdc_options directly inside cdc_extension
Instead of storing a raw map of options inside `cdc_extension`, the
extension now converts them into `cdc_options` directly on construction.
This removes the need to construct `cdc_options` object multiple times.
2020-03-05 16:09:44 +01:00
Piotr Dulikowski
6895b0e395 db::extensions: add shorthands for add_schema_extension
This abstract away a pattern used everywhere when adding a schema
extension.
2020-03-05 16:09:44 +01:00
Piotr Sarna
c35160457b Merge 'Clean up stream_id representation' from Piotr
With #5950 we changed the representation of stream_id
in CDC Log from two int columns to a single blob column.

This PR cleans up stream_id representation internally.
Now stream_id is stored as blob both in-memory and in
internal CDC tables.

Tests: unit(dev)

* hawk/stream_id_representation:
  cdc: store stream_ids as blobs in internal tables
  cdc: improve do_update_streams_description
  cdc: Fix generate_topology_description
  cdc: add stream_id::operator<
  cdc: change stream_id representation
2020-03-05 14:14:29 +01:00
Tomasz Grabiec
d5557023f6 Merge "Stop using BOOST_TEST_MESSAGE() in unit tests" from Kostja
Stop using BOOST_TEST_MESSAGE() in unit tests, it bloats test XML
output. Use Scylla logger instead.

Test: unit (debug, dev, release)
2020-03-05 13:27:30 +01:00
Calle Wilund
b48255a4cd db::commitlog: Only zero disk blocks not already allocated in segment
Fixes #5891
Refs #5899

When creating segments with the o_dsync option active, we write max_size
zeros to disk, to ensure actual disk blocks are allocated.

However, if we recycle a segment, we should, when not actually creating
a new file, check the existing size on disk, and only zero any blocks
not already allocated (i.e. if recycled file was smaller than max_size,
due to segement truncation on close).

test: unit
Message-Id: <20200226121601.15347-1-calle@scylladb.com>
2020-03-05 13:27:08 +01:00
Piotr Sarna
875d230298 Merge "CDC: use a single cdc$time value for a batch of changes"
from Kamil.

If a batch update is performed with a sequence of changes with a single
timestamp, they will now show up in CDC with a single timeuuid
in the cdc$time column, distinguished by different cdc$batch_seq_no values.

Fixes #5953.

Tests: unit(dev)

* haaawk/splitbatch:
  cdc: use a single timeuuid value for a batch of changes
  cdc: replace `split` with `for_each_change`
2020-03-05 13:17:34 +01:00
Pavel Emelyanov
7bc34c17eb range-streamer: Tune the progress message
Now it will show the full info about range being streamed, like

range_streamer - Rebuild with 127.0.0.2 for keyspace=ks2, streaming [72, 96) out of 248 ranges

The [x, y) range is semi-open one, the full streaming progress
then can be logged like

... streaming [0, 16) out of 36 ranges   <- first send
... streaming [16, 24) out of 36 ranges
... streaming [24, 36) out of 36 ranges  <- last send

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200304101505.5506-1-xemul@scylladb.com>
2020-03-05 12:56:29 +01:00
Kamil Braun
3200d415da cdc: use a single timeuuid value for a batch of changes
If a batch update is performed with a sequence of changes with a single
timestamp, they will now show up in CDC with a single timeuuid in the
`time` column, distinguished by different `batch_seq_no` values.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-05 12:32:57 +01:00
Konstantin Osipov
94ee511f6a lwt: implement cas_failed_read_round_optimization metric
Presently lightweight transactions piggy back the old
row value on prepare round response. If one of the participants
did not provide the old value or the values from peers don't match,
we perform a full read round which will repair the Paxos table and the
base table, if necessary, at all participants.

Capture the fact that read optimization has failed in a metric.
Message-Id: <20200304192955.84208-2-kostja@scylladb.com>
2020-03-05 12:20:45 +01:00
Kamil Braun
292eba9da0 cdc: replace split with for_each_change
`for_each_change` is like `split` but it doesn't return a vector of
mutations representing each change; instead, it takes as a parameter
a function which gets called on each mutation.

This reduced the memory usage and allows to preserve common context
when handling each change (will be useful in next commits).

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-05 12:05:08 +01:00
Pekka Enberg
0beb45faf3 build: Use reloc dynamic linker unconditionally
The relocatable package requires a magic dynamic linker path for
"patchelf" to work correctly. Therefore, use the "get-dynamic-linker.sh"
script to unconditionally define a magic dynamic linker path to ensure
that building the relocatable package with ninja build ("ninja-build
build/<mode>/scylla-package.tar.gz") is always correct. Although the
path looks odd with a lot of leading slashes, it works outside
relocatable package too.
Message-Id: <20200305091919.6315-2-penberg@scylladb.com>
2020-03-05 12:53:28 +02:00
Pekka Enberg
8a810cc41a reloc: Move dynamic linker magic to get-dynamic-linker.sh
In preparation for moving dynamic linker flags to ninja build, move the
magic dynamic linker path generation to "reloc/get-dynamic-linker.sh"
script that configure.py can call.
Message-Id: <20200305084331.5339-1-penberg@scylladb.com>
2020-03-05 12:53:22 +02:00
Konstantin Osipov
ac0717fb64 test: consistently use a global testlog object in all tests
Use test/lib/log.hh in all tests now that we have it.
2020-03-05 13:34:24 +03:00
Piotr Jastrzebski
57cfe6d0e1 cdc: store stream_ids as blobs in internal tables
In new CDC Log format stream_id is represented by a single
blob column so it makes sense to store it in the same form
everywhere - including internal CDC tables.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-05 11:31:22 +01:00
Piotr Jastrzebski
b2acdc9307 cdc: improve do_update_streams_description
Use std::set::insert that takes range instead of
looping through elements and adding them one by one.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-05 11:31:22 +01:00
Piotr Jastrzebski
446722d6ed cdc: Fix generate_topology_description
In new CDC Log format we store only a single stream_id column.
This means generate_topology_description has to use appropriate
schema for generating tokens for stream_ids.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-05 11:31:22 +01:00
Piotr Jastrzebski
9a212dcaef cdc: add stream_id::operator<
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-05 11:31:21 +01:00
Piotr Jastrzebski
f317a659d9 cdc: change stream_id representation
New CDC Log format stores stream ids as blobs.
It makes sense to keep them internally in the same form.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-05 11:30:10 +01:00
Piotr Sarna
f21bd57058 Merge "cdc: log static rows correctly" from Piotr
Currently, writes to a static row in a base table are not reflected
at all in the corresponding cdc log. This patch causes such writes
to be properly logged.

Fixes: #5744
Tests: unit(dev)

* piodul/5744-handle-static-row-correctly-in-cdc:
  cdc_test: add tests for handling static row
  cdc: fix indentation in transformer::transform
  cdc: handle static rows separately in transformer::transform
  cdc: move process_cells higher (and fix captured variables)
  cdc: reduce dependencies on captured variables in process_cells
  cdc: fix preimage query for static rows
2020-03-05 10:42:15 +01:00
Nadav Har'El
96ca5ac2c8 alternator: use separate smp_service_group for bouncing requests
Until this patch, we used the default_smp_service_group() when bouncing
Alternator requests between shards (which is needed for LWT).

This patch creates a new smp_service_group for this purpose, which is
limited to 5000 concurrent requests (the same limit used for CQL's
bounce_request_smp_service_group). The purpose of this limit is to avoid
many shards admitting a huge number of requests and bouncing all of them
to the same shard who now can't "unadmit" these requests.

Fixes #5664.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200304170825.27226-1-nyh@scylladb.com>
2020-03-05 10:17:51 +01:00
Konstantin Osipov
ff3f9cb7cf test: stop using BOOST_TEST_MESSAGE() for logging
We use boost test logging primarily to generate nice XML xunit
files used in Jenkins. These XML files can be bloated
with messages from BOOST_TEST_MESSAGE(), hundreds of megabytes
of build archives, on every build.

Let's use seastar logger for test logging instead, reserving
the use of boost log facilities for boost test markup information.
2020-03-05 11:38:11 +03:00
Juliusz Stasiewicz
c8527f20b0 CDC+LWT: fix missing CDC entries for successful LWTs
Now, if CDC is enabled, `paxos_response_handler::learn_decision()`
augments the base table mutation. The differences in logic between:
(1) `mutate_internal<std::vector<mutation>>()`
and
(2) `mutate_internal<std::vector<std::tuple<paxos::proposal, schema_ptr, ...>>>()`
make it necessary to separate "CDC mutations" from "base mutation"
and send them, respectively, to (1) and (2).

Gleb explained in #5869 why it became necessary to add CDC code to LWT
writes specifically, instead of doing it somewhere central that affects
all writes:

"All paths that do write goes through mutate_internally() eventually so it
would have been best to do augmentations there, but cdc chose to log only
certain writes and not others (unlike MV that does not care how write
happened) and mutate_internal have no idea which is which so I do not have
other choice but code duplication. ... paxos_response_handler::learn_decision
is probably the place to add cdc augmentation."

Fixes #5869
2020-03-05 09:49:19 +02:00
Piotr Dulikowski
204e204586 cdc: do not attempt to log empty mutations
It is possible to produce an empty mutation using CQL. For example, the
following query:

DELETE FROM ks.tbl WHERE pk = 0 AND ck < 1 AND ck > 2;

will attempt to delete from an empty range of rows. This is translated
to the following mutation:

{ks.tbl {key: pk{000400000000}, token:-3485513579396041028}
 {mutation_partition:
  static: cont=1 {row: },
  clustered: {}}}

Such mutation does not contain any timestamp, therefore it is difficult
to determine what timestamp was used while making the query. This is
problematic for CDC, because an entry in CDC log should be written with
the same timestamp as a part of the mutation.

Because an empty mutation does not modify the table in any way, we can
safely skip logging such mutations in CDC and still preserve the
ability to reconstruct the current state of the base table from full
CDC log.

Tests: unit(dev)
2020-03-05 08:32:54 +01:00
Piotr Dulikowski
e6751fad62 cdc_test: add tests for handling static row 2020-03-05 00:16:17 +01:00
Piotr Dulikowski
39519ce923 cdc: fix indentation in transformer::transform 2020-03-05 00:16:17 +01:00
Piotr Dulikowski
0d05b17881 cdc: handle static rows separately in transformer::transform
Before this patch, `transform` did not generate any log rows about
static row change. This commit fixes that - now, a log row is created if
a static row is changed, and this row is separate from the rows that
describe changes to the clustering rows.
2020-03-05 00:16:17 +01:00
Piotr Dulikowski
6a0b0b5786 cdc: move process_cells higher (and fix captured variables)
The `process_cells` lambda is moved outside the loop, because it will be
used by other code in subsequent commits.
2020-03-05 00:15:57 +01:00
Piotr Dulikowski
f136f6e02c cdc: reduce dependencies on captured variables in process_cells
This is a preparation for moving the lambda outside the for loop.

- `log_ck`, `pikey`, `pirow` are now passed as arguments,
- `value` is now a variable local to the lambda,
- `ttl` is now a variable local to the lambda that is returned.
2020-03-05 00:14:05 +01:00
Piotr Dulikowski
a7f51449c3 cdc: fix preimage query for static rows
For static rows, we need to fetch at least one row from its partition in
order to compute its preimage.
2020-03-04 18:43:55 +01:00
Botond Dénes
8b908a9aba test: lib/mutation_source_test: log the name of the test-method
Most test-methods log a message with their names upon entering them.
This helps in identifying the test-method a failure happened in in the
logs. Two methods were missing this log line, so add it.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200304155235.46170-1-bdenes@scylladb.com>
2020-03-04 18:16:21 +02:00
Pekka Enberg
7fde2e28da dist/redhat: Specify files once in scylla.spec file
Silences the following warnings when building an RPM:

  warning: File listed twice: /opt/scylladb/scripts/libexec/hex2list.py
  warning: File listed twice: /opt/scylladb/scripts/libexec/node_exporter_install
  warning: File listed twice: /opt/scylladb/scripts/libexec/perftune.py
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla-blocktune
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla-housekeeping
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_bootparam_setup
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_config_get.py
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_coredump_setup
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_cpuscaling_setup
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_cpuset_setup
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_dev_mode_setup
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_ec2_check
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_fstrim
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_fstrim_setup
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_io_setup
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_kernel_check
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_ntp_setup
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_prepare
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_raid_setup
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_selinux_setup
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_setup
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_stop
  warning: File listed twice: /opt/scylladb/scripts/libexec/scylla_sysconfig_setup
  warning: File listed twice: /opt/scylladb/scripts/libexec/seastar-addr2line
  warning: File listed twice: /opt/scylladb/share/doc/scylla/licenses
  warning: File listed twice: /opt/scylladb/share/doc/scylla/licenses/LICENSE-crc32-vpmsum.TXT
  warning: File listed twice: /opt/scylladb/share/doc/scylla/licenses/README.md
  warning: File listed twice: /opt/scylladb/share/doc/scylla/licenses/apache-license-2.0.txt
  warning: File listed twice: /opt/scylladb/share/doc/scylla/licenses/boost-license-1.0.txt
  warning: File listed twice: /opt/scylladb/share/doc/scylla/licenses/date-license.txt
  warning: File listed twice: /opt/scylladb/share/doc/scylla/licenses/git-archive-all-license.txt
  warning: File listed twice: /opt/scylladb/share/doc/scylla/licenses/libdeflate-license.txt
  warning: File listed twice: /opt/scylladb/share/doc/scylla/licenses/xxhash-license.txt
  warning: File listed twice: /opt/scylladb/share/doc/scylla/licenses/zstd-license.txt

I verified that the files are in the generated RPMs after the change:

  [penberg@nero scylla]$ rpm -ql build/dist/dev/redhat/RPMS/x86_64/scylla-server-666.development-0.20200304.2bc700b008.x86_64.rpm | grep scripts.*libexec
  /opt/scylladb/scripts/libexec
  /opt/scylladb/scripts/libexec/hex2list.py
  /opt/scylladb/scripts/libexec/node_exporter_install
  /opt/scylladb/scripts/libexec/perftune.py
  /opt/scylladb/scripts/libexec/scylla-blocktune
  /opt/scylladb/scripts/libexec/scylla-housekeeping
  /opt/scylladb/scripts/libexec/scylla_bootparam_setup
  /opt/scylladb/scripts/libexec/scylla_config_get.py
  /opt/scylladb/scripts/libexec/scylla_coredump_setup
  /opt/scylladb/scripts/libexec/scylla_cpuscaling_setup
  /opt/scylladb/scripts/libexec/scylla_cpuset_setup
  /opt/scylladb/scripts/libexec/scylla_dev_mode_setup
  /opt/scylladb/scripts/libexec/scylla_ec2_check
  /opt/scylladb/scripts/libexec/scylla_fstrim
  /opt/scylladb/scripts/libexec/scylla_fstrim_setup
  /opt/scylladb/scripts/libexec/scylla_io_setup
  /opt/scylladb/scripts/libexec/scylla_kernel_check
  /opt/scylladb/scripts/libexec/scylla_ntp_setup
  /opt/scylladb/scripts/libexec/scylla_prepare
  /opt/scylladb/scripts/libexec/scylla_raid_setup
  /opt/scylladb/scripts/libexec/scylla_selinux_setup
  /opt/scylladb/scripts/libexec/scylla_setup
  /opt/scylladb/scripts/libexec/scylla_stop
  /opt/scylladb/scripts/libexec/scylla_sysconfig_setup
  /opt/scylladb/scripts/libexec/seastar-addr2line
  [penberg@nero scylla]$ rpm -ql build/dist/dev/redhat/RPMS/x86_64/scylla-server-666.development-0.20200304.2bc700b008.x86_64.rpm | grep license
  /opt/scylladb/share/doc/scylla/licenses
  /opt/scylladb/share/doc/scylla/licenses/LICENSE-crc32-vpmsum.TXT
  /opt/scylladb/share/doc/scylla/licenses/README.md
  /opt/scylladb/share/doc/scylla/licenses/apache-license-2.0.txt
  /opt/scylladb/share/doc/scylla/licenses/boost-license-1.0.txt
  /opt/scylladb/share/doc/scylla/licenses/date-license.txt
  /opt/scylladb/share/doc/scylla/licenses/git-archive-all-license.txt
  /opt/scylladb/share/doc/scylla/licenses/libdeflate-license.txt
  /opt/scylladb/share/doc/scylla/licenses/xxhash-license.txt
  /opt/scylladb/share/doc/scylla/licenses/zstd-license.txt

Message-Id: <20200304150057.2621-1-penberg@scylladb.com>
2020-03-04 17:25:53 +02:00
Tomasz Grabiec
da4bd3d2e6 Merge "Clean cql3 usage of storage_proxy and _service" from Pavel E.
This set removes _all_ mentionings of storage_service and _all_ calls
for global storage_proxy instances from cql3/ code.

Tests: unit(dev)
2020-03-04 15:20:24 +01:00
Raphael S. Carvalho
3ba3ee2a7b distributed_loader: trigger regular compaction on resharding completion
Regular compaction relies on compaction manager to run compaction jobs
until compaction strategy is satisfied. Resharding, on the other hand,
is an one-off operation which runs only once in compaction manager,
and leave the sstable set in such a way that the strategy is very
likely unsatisfied. We need to trigger regular compaction whenever
a resharding job replaces a shared sstable by an unshared sstable,
so that compaction will not fall way behind due to lots of new sstables
created by resharding process.

Fixes #5262.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200217144946.20338-1-raphaelsc@scylladb.com>
2020-03-04 16:08:13 +02:00
Nadav Har'El
f67a402c48 merge: Remove treewide dependency on boost/multiprecision
Merged patch series from Avi Kivity:

boost/multiprecision is a heavyweight library, pulling in 20,000 lines of code into
each header that depends on it. It is used by converting_mutation_partition_applier
and types.hh. While the former is easy to put out-of-line, the latter is not.

All we really need is to forward-declare boost::multiprecision::cpp_int, but that
is not easy - it is a template taking several parameters, among which are non-type
template parameters also defined in that header. So it's quite difficult to
disentangle, and fragile wrt boost changes.

This patchset introduces a wrapper type utils::multiprecision_int which _can_
be forward declared, and together with a few other small fixes, manages to
uninclude boost/multiprecision from most of the source files. The total reduction
in number of lines compiled over a full build is 324 * 23,227 or around 7.5
million.

Tests: unit (dev)
Ref #1

https://github.com/avikivity/scylla uninclude-boost-multiprecision/v1

Avi Kivity (5):
  converting_mutation_partition_applier: move to .cc file
  utils: introduce multiprecision_int
  tests: cdc_test: explicitly convert from cdc::operation to uint8_t
  treewide: use utils::multiprecision_int for varint implementation
  types: forward-declare multiprecision_int

 configure.py                             |   2 +
 concrete_types.hh                        |   2 +-
 converting_mutation_partition_applier.hh | 163 ++-------------
 types.hh                                 |  12 +-
 utils/big_decimal.hh                     |   3 +-
 utils/multiprecision_int.hh              | 256 +++++++++++++++++++++++
 converting_mutation_partition_applier.cc | 188 +++++++++++++++++
 cql3/functions/aggregate_fcts.cc         |  10 +-
 cql3/functions/castas_fcts.cc            |  28 +--
 cql3/type_json.cc                        |   2 +-
 lua.cc                                   |  38 ++--
 mutation_partition_view.cc               |   2 +
 test/boost/cdc_test.cc                   |   6 +-
 test/boost/cql_query_test.cc             |  16 +-
 test/boost/json_cql_query_test.cc        |  12 +-
 test/boost/types_test.cc                 |  58 ++---
 test/boost/user_function_test.cc         |   2 +-
 test/lib/random_schema.cc                |  14 +-
 types.cc                                 |  20 +-
 utils/big_decimal.cc                     |   4 +-
 utils/multiprecision_int.cc              |  37 ++++
 21 files changed, 627 insertions(+), 248 deletions(-)
 create mode 100644 utils/multiprecision_int.hh
 create mode 100644 converting_mutation_partition_applier.cc
 create mode 100644 utils/multiprecision_int.cc
2020-03-04 15:13:42 +02:00
Avi Kivity
5dee627f73 types: forward-declare multiprecision_int
This reduces the number of translation units that depend on
boost/multiprecision from 354 to 30, and reduces the size of
database.i (as an example) from 406160 to 382933 (smaller
files will benefit more, relatively).

Ref #1
2020-03-04 13:28:16 +02:00
Avi Kivity
3c772757c0 treewide: use utils::multiprecision_int for varint implementation
The goal is to forward-declare utils::multiprecision_int, something
beyond my capabilities for boost::multiprecision::cpp_int, to reduce
compile time bloat.

The patch is mostly search-and-replace, with a few casts added to
disambiguate conversions the compiler had trouble with.
2020-03-04 13:28:16 +02:00
Avi Kivity
874f65c58c tests: cdc_test: explicitly convert from cdc::operation to uint8_t
After the varint data type starts using the new multiprecision_int type,
this code fails to compile. I expect that somehow the conversion from enum
class to cpp_int was allowed to succeed, and we ended up with a data_value
of type varint. The tests succeeded because the serialized representation
happened to be the same.
2020-03-04 13:28:16 +02:00
Piotr Jastrzebski
354e3c34c8 cdc log: merge stream_id columns into a single column
Previously we had stream_id_1 and stream_id_2 columns
of type long each. They were forming a partition key.

In a new format we want a single stream_id column that
forms a partition key. To be able to still store two
longs, the new column will have type blob and its value
will be concatenated bytes of two longs that
partition key is composed of.

We still want partition key to logically be two longs
because those two values will be used by a custom partitioner
later once we implement it.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-04 13:27:48 +02:00
Avi Kivity
7434c81a29 utils: introduce multiprecision_int
multiprecision_int is a wrapper around boost::multiprecision::cpp_int that adds
no functionality. The intent is to allow forward declration; cpp_int is so
complicated that just finding out what its true type is a difficult exercise, as
it depends on many internal declarations.

Because cpp_int uses expression templates, the implementation has to explicitly
cast to the desired type in many places, otherwise the C++ compile is presented
with too many choices, especially in conjunction with data_value (which can
convert from many different types too).
2020-03-04 12:42:57 +02:00
Avi Kivity
414ec8c68e converting_mutation_partition_applier: move to .cc file
converting_mutation_partition_applier is a heavyweight class that is not
used in the hot path, so it can be safely out-of-lined. This moves
some includes to boost/multiprecision out of header files, where they
can infect a lot of code.

mutation_partition_view.cc's includes were adjusted to recover
missing dependencies.
2020-03-04 12:42:57 +02:00
Pavel Emelyanov
35b0e6dd7f repair_writer: Use db from repair_meta (2nd try)
The previous version errorneously used local db reference
which was propagated into another shard. This time carry
the sharded instance and use .local() as before.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200303221729.31261-1-xemul@scylladb.com>
2020-03-04 11:31:52 +01:00
Tomasz Grabiec
477dadc062 Merge "cql_test_env: Drop a few shared_ptr<sharded<...>>" from Rafael
I found that a few variables in cql_test_env were wrapping sharded in
shared_ptr for no apparent reason. These patches convert them to plain
sharded<...>.
2020-03-04 11:31:52 +01:00
Yaron Kaikov
de19496ff7 dist/docker: Add VERSION argument to Dockerfile (#5845)
Currently, the Dockerfile installs the latest version of Scylla. Let's
add a VERSION argument to Dockerfile, which explicitly specifies the
version to ensure scripts, for example, always build the expected
version. If no VERSION is specified for "docker build", use the default
value of "666.development", which is the version number for latest
nightly.
2020-03-04 12:20:24 +02:00
Pekka Enberg
e76b5bdf7b Merge 'Cleanup test.py output' from Kostja
"These two patches were made suspect of failing next promotion and
 excluded from the original series."

* 'test.py.log' of https://github.com/kostja/scylla:
  test.py: remove log output on success unless -s is specified
  test.py: do not store entire log output in junit report.
2020-03-04 11:58:46 +02:00
Eliran Sinvani
99cedf737c docker: rsyslog configuration fixes
The introduction of rsyslog had two errors in it.
Both errors are non fatal and the docker still works,
however, the system is left in a wrong state in which
supervisord marks rsyslogd service as failed (after several
failed retry attempts). Another bug in the configuration
causes rsyslog to output an error.

1) An inclusion command from a newer version was used
in rsyslogs main configuration file. This caused to rsyslog
to complain during startup but it didn't do much damage since
rsyslog converts every unrecognised command to a message command.
2) in the supervisord definition of the service, rsyslogd is ran
without the -n option which means it defaults to automatically
switch to the background. Supervisord interpret this as an unexpected
process termination and retries to start the process (unsuccessfully
because rsyslog protects itself from having multiple processes of
itself) and eventually marks it as down although it is fully up and
running.
This commit fixes both configuration problems.

Tests: Build and run docker and validate the errors are gone.
Fixes #5937
2020-03-04 11:56:30 +02:00
Pekka Enberg
325c3e13eb build: Switch to SHA1 build IDs
Currently, you have to build the relocatable package tarball with
./reloc/build_reloc.sh to be able to build an RPM out of it. You need to
do this because RPMS require SHA1 build-ids, but the build system does
not enforce that.

To prepare for adding RPM target to the ninja build, let's switch to
SHA1 build ID conditionally, because the performance difference between
xxhash and SHA1 is neglible. Rafael Avila de Espindola writes:

  [...] the sha1 implementation in current lld is pretty fast. Linking
  release scylla the times I get are

  lld in fedora
    fast 2.83739
    sha1 3.51990

  current lld
    fast 2.6936
    sha1 2.90250

  And the sha1 implementation might get even faster:

  https://bugs.llvm.org/show_bug.cgi?id=44138.
Message-Id: <20200303131806.22422-1-penberg@scylladb.com>
2020-03-04 11:00:43 +02:00
Tomasz Grabiec
82b76163e3 utils/small_vector: Add missing include
Needed for std::uninitialized_move() et al

Message-Id: <20200303191148.11716-1-tgrabiec@scylladb.com>
2020-03-03 21:23:40 +02:00
Tomasz Grabiec
5dfefc0a85 Revert "repair_writer: Use db from repair_meta"
This reverts commit c6ddd21c50.

Uses database& instance across shards, which causes repair writer to
use the table object from the wrong shard.

Fixes #5907
2020-03-03 19:50:53 +01:00
Avi Kivity
906784639d Merge "Clean sstables from using global objects" from Pavel E
"
This set cleans sstable_writer_config and surrounding sstables
code from using global storage_ and feature_ service-s and database
by moving the configuration logic onto sstables_manager (that
was supposed to do it since eebc3701a5).

Most of the complexity is hidden around sstable_writer_config
creation, this set makes the sstables_manager create this object
with an explicit call. All the rest are consequences of this change.

Tests: unit(debug), manual start-stop
"

* 'br-clean-sstables-manager-2' of https://github.com/xemul/scylla:
  sstables: Move get_highest_supported_format
  sstables: Remove global get_config() helper
  sstables: Use manager's config() in .new_sstable_component_file()
  sstable_writer_config: Extend with more db::config stuff
  sstables_manager: Don't use global helper to generate writer config
  sstable_writer_config: Sanitize out some features fields initialization
  sstable_writer_config: Factor out some field initialization
  sstables: Generate writer config via manager only
  sstables: Keep reference on manager
  test: Re-use existing global sstables_manager
  table: Pass sstable_writer_config into write_memtable_to_sstable
2020-03-03 18:33:01 +02:00
Nadav Har'El
750fe9585a alternator: change rjson::get() to take std::string_view
Change rjson::get() to take std::string_view, instead of RapidJson's
version of that type, "StringRef". We already did the same change for
rjson::find() in a previous patch.

Not only is std::string_view more convenient for potential callers in Scylla,
this change also avoids a bug in FindMember() on StringRef where the length
is ignored (and instead, null-termination of the string is assumed).

This patch doesn't require any changes to callers, because we actually
had just a handful of remaining callers (most call sites switched to
rjson::find()), and all of them used string constants which could be
implicitly converted to StringRef or std::string_view just the same.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200303161019.1456-1-nyh@scylladb.com>
2020-03-03 17:13:40 +01:00
Nadav Har'El
91d9632909 alternator: add rjson::remove_member() convenience function
This patch adds a rjson::remove_member() wrapper to the RemoveMember
method, which takes a std::string_view. But beyond the convenience, this
actually works around a subtle bug in RemoveMember where, if given a
StringRef parameter, ignores its length (see upstream issue
https://github.com/Tencent/rapidjson/issues/1649).

In the one place we used RemoveMember, it forced us to copy the string
because it wasn't null-terminated. The solution proposed here involves
wrapping the string view in a GenericValue - which no longer needs to copy
the string, but still works around the bug.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200303143524.28300-1-nyh@scylladb.com>
2020-03-03 16:35:41 +01:00
Nadav Har'El
0fcb226412 alternator: switch rjson::find() to use std::string_view
Our rjson::find() convenience function used RapidJson's "StringRef" type,
which is almost exactly like std::string_view. If we switch to use
string_view as we do in this patch, a lot of call sites become much simpler.

Moreover, there was an even more important motivation for this patch:
the RapidJson FindMember() function we used in rjson::find() has a bug when
given a StringRef - although a StringRef contains a length, the FindMember()
code ignores it and expects the string to be null-terminated (see:
https://github.com/Tencent/rapidjson/issues/1649). In this patch, we wrap
the pointer and length of a std::string_view in an rjson::value, a code path
which bypasses the FindMember bug, and yet does not require copying the
string.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200303141814.26929-1-nyh@scylladb.com>
2020-03-03 16:35:41 +01:00
Nadav Har'El
2ea0b9d226 Merge branch 'split-mutations' of github.com:haaawk/scylla into next
Merged pull request https://github.com/scylladb/scylla/pull/5940 from
Kamil Braun:

Add a bunch of new structs describing a change made
to a table, and an extract_changes function which takes a mutation and
returns the set of changes contained in this mutation, separated by
timestamp and ttl.

Add a split function which uses extract_changes to split a mutation into separate mutations, each describing a single change.

Static rows are put into separate changes now.

The pre_image_select function was fixed to select pre_image data always when
there is a static row/clustered row change, even if there were e.g. additional
range tombstones.

Fixes: #5719.

Tests: unit(dev)
2020-03-03 17:27:21 +02:00
Botond Dénes
103bf50e18 storage_proxy: add timeouts to smp calls on the write path
When a node is overloaded requests usually start to queue up. Timeouts
are supposed to prevent queues from exploding and causing an OOM. One
prominent queue that tends to explode is the smp queue as it didn't
support timeouts and so requests would sit in the queue until the target
shard would process them. If the target shard is heavily overloaded
requests might accumulate faster then they are processed, surely leading
to an OOM.

To prevent this use the recently introduces timeout to
`seastar::smp::submit_to()` and derived APIs to time out write requests
sitting in the smp queue. We simply use the request's own timeout
for this purpose.

Fixes: #5055
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200303131658.741720-1-bdenes@scylladb.com>
2020-03-03 15:39:58 +02:00
Kamil Braun
5de9b5b566 cdc: add change splitting test
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-03 13:31:19 +01:00
Kamil Braun
5c4a237c12 cdc: split the mutation before passing it into transform
If the mutation contains separate logical changes (e.g. with different
timestamps and/or ttls), it will be split into multiple mutations, each
passed into transform.
2020-03-03 13:17:51 +01:00
Kamil Braun
9924e3aa34 cdc: reduce code duplication in augment_mutation_call
Now there's only one call to `transform`.
2020-03-03 13:17:51 +01:00
Kamil Braun
24a32a13b5 cdc: retrieve preimage anytime there are static/clustered row updates
Previously we wouldn't retrieve the preimage if the mutation contained
something different than static/clustered row updates, e.g. if it
contained a partition deletion.

However, there are mutations created from batch statements which can
contain both a partition deletion and a set of row updates with a later
timestamp. We want to retrieve the preimage too in this case.
2020-03-03 13:17:51 +01:00
Kamil Braun
529d30ef66 cdc: add split function
This function takes a mutation and returns a set of mutations, each
representing a separate change with a single timestamp and ttl.
2020-03-03 13:17:51 +01:00
Kamil Braun
132ea89c32 cdc: add extract_changes function
This commit introduces a bunch of new structs describing a change made
to a table, and an `extract_changes` function which takes a mutation and
returns the set of changes contained in this mutation, separated by
timestamp and ttl.
2020-03-03 13:17:51 +01:00
Kamil Braun
b5c944370e cdc: add should_split function
The function checks if there are multiple timestamps and/or ttls inside
a mutation, which means separate changes should be created for this
mutation in CDC.
2020-03-03 13:17:50 +01:00
Konstantin Osipov
48f09b95d0 test.py: remove log output on success unless -s is specified
Log output is saved by the build system and can take a lot of
space. Remove it unless -s is specified.
2020-03-03 13:59:14 +03:00
Konstantin Osipov
ae2820a1c7 test.py: do not store entire log output in junit report.
This makes report very heavy and is suspected to corrupt
XML output.
2020-03-03 13:59:14 +03:00
Nadav Har'El
359b32fb63 merge: CDC: implement new column format and naming
Merged pull request https://github.com/scylladb/scylla/pull/5910 by
Calle Wilund:

Rename metadata and data columns according to new spec

Also use transformation methods for names in all code + tests
to make switching again easier

Break up data column tuple

Data column is now pure frozen original type.

    If column is deleted (set to null), a metadata column cdc$deleted_ is set to true, to distinguish null column == not involved in row operation
    For non-atomic collections, a cdc$deleted_elements_ column is added, and when removing elements from collection this is where they are shown.
    For non-atomic assign, the "cdc$deleted_" is true, and is set to new value.

column_op removed.
2020-03-03 12:36:16 +02:00
Pavel Emelyanov
4fa12f2fb8 header: De-bloat schema.hh
The header sits in many other headers, but there's a handy
schema_fwd.hh that's tiny and contains needed declarations
for other headers. So replace shema.hh with schema_fwd.hh
in most of the headers (and remove completely from some).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200303102050.18462-1-xemul@scylladb.com>
2020-03-03 11:34:00 +01:00
Piotr Jastrzebski
f105f43008 commitlog: remove FIXME
In segment_manager::on_timer() there's a FIXME
to stop discarding future returned from sync()
but sync() does not return any future so it's safe
to remove the FIXME and stop casting to (void).

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6d6d819cb2972e47e5f3fbe7b896499c64b09e53.1583230579.git.piotr@scylladb.com>
2020-03-03 12:21:56 +02:00
Calle Wilund
ed0d1c5fe2 cdc: Break up data column tuple
According to "new" spec:

Data column is now pure frozen original type.

If column is deleted (set to null), a metadata column
cdc$deleted_<name> is set to true, to distinguish
null column == not involved in row operation

For non-atomic collections, a cdc$deleted_elements_<name>
column is added, and when removing elements from collection
this is where they are shown.

For non-atomic assign, the "cdc$deleted_<name>" is true,
and <name> is set to new value.

column_op removed.
2020-03-03 08:52:20 +00:00
Rafael Ávila de Espíndola
28e59566a8 cql_test_env: Don't use a shared_ptr for token_metadata
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-03-02 13:52:23 -08:00
Rafael Ávila de Espíndola
47f8a63279 cql_test_env: Don't use a shared_ptr for migration_notifier
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-03-02 13:51:45 -08:00
Rafael Ávila de Espíndola
ed0c4d2801 cql_test_env: Don't use a shared_ptr for view_update_generator
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-03-02 13:51:25 -08:00
Rafael Ávila de Espíndola
ff2edd15d4 cql_test_env: Don't use a shared_ptr for view_builder
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-03-02 13:50:48 -08:00
Rafael Ávila de Espíndola
9375478803 cql_test_env: Don't use a shared_ptr for feature_service
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-03-02 13:50:25 -08:00
Rafael Ávila de Espíndola
5e87562f33 cql_test_env: Don't use a shared_ptr for database
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-03-02 13:50:08 -08:00
Rafael Ávila de Espíndola
a4b7de4d5d cql_test_env: Don't use a shared_ptr for auth::service
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-03-02 13:49:46 -08:00
Botond Dénes
8a1c8ce8a6 mutation_partition: make query_result_builder safely movable
`query_result_builder` is movable but if you actually try to move it
after it having consumed some fragments it will blow up in your face
when you try to use it again. This is because its `mutation_querier`
member received a reference to its `query::result::partition_writer`. Of
course the reference to the latter was invalidated on move so the former
accessed invalid memory. Since `query::result::partition_writer` wasn't
actually used for anything other, just move it into the
`mutation_querier`, making `query_result_builder` actually safe to move.

Fixes: #3158
Message-Id: <20190830142601.51488-1-bdenes@scylladb.com>
2020-03-02 18:46:59 +01:00
Botond Dénes
4da0a1d397 docs/debugging.md: mention another method of helping gdb find sources
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200225124701.80706-1-bdenes@scylladb.com>
2020-03-02 18:26:29 +01:00
Pavel Emelyanov
86ca4b83d0 Revert "Revert "features: Stop on shutdown""
This reverts commit 165913598b.
2020-03-02 19:56:18 +03:00
Pavel Emelyanov
0a10e9787e features: Remove future-based when_enabled()
This API is considered to be error-prone, all users of it
are reworked, so let's drop it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-03-02 19:55:52 +03:00
Pavel Emelyanov
e63f5187b2 system_keyspace: Rework migrate_truncation_records feature subscription
The function in question uses future-based .when_enabled() subscription
on cluster_supports_truncation_table feature. This method is considered
to be unsafe, so here's the patch that changes it onto feature::listener.

The completion of the migration is only awaited by a single test, so
this waiting mechanism is also slightly simplified.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-03-02 19:55:28 +03:00
Tomasz Grabiec
e17db536fd Merge "lwt: support LIKE operator in conditional expressions" from Alejo
Support LIKE operator condition on column expressions.

NOTE: following the existing code, the LIKE pattern value is converted
      to raw bytes and passed straight as bytes_view to like_matcher
      without type checking; it should be checked/sanitized by caller.

Refs: #5777

Branch URL: https://github.com/alecco/scylla/tree/as_like_condition_2

Tests: unit ({dev}), unit ({debug})

NOTE: fail for unrelated test test_null_value_tuple_floating_types_and_uuids
2020-03-02 17:36:57 +01:00
Botond Dénes
6218153543 scylla-gdb.py: introduce collection_element()
Extracting a certain element from a collection is a common task I have
to do while debugging cores. For certain collections (c-array,
std::array) this is trivial, for others it is easy enough (std::vector),
but for some (std::list) this is a tiresome work-intensive process.
This convenience function allows getting a reference to any element of
the supported container types, returning them for further use in the
interactive session.
Currently only `std::list` and `std::vector` are supported.
2020-03-02 16:28:49 +01:00
Botond Dénes
94352b3426 scylla-gdb.py: generalize dereference_lw_shared_ptr()
To be a generic convenience function for dereferencing all sorts of
smart pointers. For now `std::unique_ptr`, `seastar::lw_shared_ptr` and
`seastar::foreign_ptr` are supported.
2020-03-02 16:28:04 +01:00
Botond Dénes
b6f8a6fbd3 test/boost: sstable_datafile_test: sstable_scrub_test: stop table
`table` is not registered with the database, and hence will not be
waited on during shutdown.
Stop it explicitly to prevent any asynchronous operation on it racing
with shutdown.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200302142845.569638-1-bdenes@scylladb.com>
2020-03-02 16:20:00 +01:00
Calle Wilund
b6443e44b9 set: Make set_type_impl::serialize_partially_deserialized_form static
Conform with map + does not require any instance info.
2020-03-02 14:43:34 +00:00
Pavel Solodovnikov
64451e5f51 cql3: minor cleanups regarding cql3::attributes::raw class
* Mark cql3::attributes::raw class as final
 * Change every occurrence of ::shared_ptr<attributes::raw>
   to std::unique_ptr<...>
 * Mark all methods in cql3::attributes::raw as const
 * Remove redundant "_attrs" ptr copy in insert_json_statement,
   use one from raw::modification_statement
 * Fix odd indentation in cql3/statements/update_statement.cc

Tests: unit-tests (dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200301223708.99883-1-pa.solodovnikov@scylladb.com>
2020-03-02 13:26:01 +01:00
Tomasz Grabiec
51cfd13f8c gdb: Fix get_local_tasks()
chunked_vector holds task* directly after seastar commit
bcb5cf3a8dca19be0e577ee4e3bcd246f949dce6.

Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200227171722.7189-1-tgrabiec@scylladb.com>
2020-03-02 12:02:19 +02:00
Tomasz Grabiec
57a3f3e36b gdb: Fix std_variant::get() when index is > 0
_get_next() was recursively calling itself with index - 1 if index was
> 0. When we reached the desired element we always tried to use
member_types[0] as the type, which is incorrect since member_types
contains all types and doesn't change in get().

Fix by replacing recursion with iteration so that we keep the original
index.

Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1582900804-18681-1-git-send-email-tgrabiec@scylladb.com>
2020-03-02 11:59:19 +02:00
Tomasz Grabiec
4942c4c22b gdb: Drop class keyword when constructing type name in seastar_lw_shared_ptr
I encountered a case when template type name is not resolved when
"class " is present.

Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1582900998-19267-1-git-send-email-tgrabiec@scylladb.com>
2020-03-02 11:58:44 +02:00
Calle Wilund
1085860c62 cdc: Rename metadata and data columns according to new spec
Also use transformation methods for names in all code + tests
to make switching again easier
2020-03-02 09:34:51 +00:00
Piotr Sarna
c62863cf69 alternator: restore verbose parsing error messages
When wrapping rapidjson routines with safer, yieldable code,
parsing information was lost, because the JSON reader was not
checked for parsing errors before further processing.
That resulted in nearly all parsing errors being reduced to
"Assertion failed: StackSize() != 1". After this patch,
all various errors (missing quotations, colons, object names,
etc.) are properly returned for the user.

Message-Id: <968ce2f7539bf33d3eb829f0ab431b788d291602.1583134221.git.sarna@scylladb.com>
2020-03-02 11:29:09 +02:00
Tomasz Grabiec
4c0ddf3a2d gdb: Introduce 'scylla features' command
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1582901194-19903-1-git-send-email-tgrabiec@scylladb.com>
2020-03-02 11:28:13 +02:00
Nadav Har'El
ba536dbc95 alternator-test: don't warn about not verifying SSL certificates
When running the Alternator tests, we don't care about verifying the
pedigree of the SSL certificates - we actually know the ones we use
in our test setups are fake, and not signed by any respectable certificate
authority.

We already use "verify=False" in many requests to avoid the certificate
checking, but then we start getting scary-looking warning messages about
an "Unverified HTTPS request is being made.". There's a way to disable
these warnings, but we only did in some cases, and there were still some
tests that show these warnings. Let's do it once, in a way that affects
all tests.
Message-Id: <20200301175607.8841-1-nyh@scylladb.com>
2020-03-01 22:59:20 +01:00
Juliusz Stasiewicz
cf24ae86f3 cdc: distinguishing update from insert
When incoming mutation contains live row marker the `operation` is
described as "insert", not as an "update".

Also, I extended the test case "test_row_delete" with one insert,
which is expected to log different value of `operation` than update
or delete. Renamed the test case accordingly.

Test cases that relied on "update" being the same as "insert" are
updated accordingly (`test_pre_image_logging`, `test_cdc_across_shards`,
`test_add_columns`).

Fixes #5723
2020-03-01 17:50:08 +02:00
Avi Kivity
157fe4bd19 Merge "Remove default timeouts" from Botond
"
Timeouts defaulted to `db::no_timeout` are dangerous. They allow any
modifications to the code to drop timeouts and introduce a source of
unbounded request queue to the system.

This series removes the last such default timeouts from the code. No
problems were found, only test code had to be updated.

tests: unit(dev)
"

* 'no-default-timeouts/v1' of https://github.com/denesb/scylla:
  database: database::query*(), database::apply*(): remove default timeouts
  database: table::query(): remove default timeout
  mutation_query: data_query(): remove default timeout
  mutation_query: mutation_query(): remove default timeout
  multishard_mutation_query: query_mutations_on_all_shards(): remove default timeout
  reader_concurrency_semaphore: wait_admission(): remove default timeout
  utils/logallog: run_when_memory_available(): remove default timeout
2020-03-01 17:29:17 +02:00
Alejo Sanchez
c3b157a80b lwt: support LIKE operator in conditional expressions
Adds support of LIKE operator in conditional column expressions.

Refs: #5777

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-03-01 14:22:10 +01:00
Piotr Sarna
2137017bc3 alternator: revert to ValidationException for JSON errors
Both rapidjson library and DynamoDB induce enough corner cases
for incorrect JSON, that the simplest way out is to simply
conform back to ValidationException in all cases.
This commit comes with an updated test, which is now aware
of 3 possible outcomes for an incorrect JSON:
a ValidationException, a SerializationException and HTTP 404.
Message-Id: <5e39d2dc077f4ea5ce360035a4adcddaf3a342a0.1582876734.git.sarna@scylladb.com>
2020-03-01 14:35:20 +02:00
Avi Kivity
1ed06cdb7c Revert "dist/common/scripts/scylla_coredump_setup: bind-mount coredump directory, add coredump test"
This reverts commit 65aadad9a6. It causes
crashes (due to the coredump test) during package install, since scylla_coredump_setup
is called from rpm postinstall. The test should be done only from scylla_setup (and
the user should be warned).

Fixes #5916.
2020-03-01 14:32:31 +02:00
Avi Kivity
db544db5e2 Merge "Convert a few APIs to std::string_view" from Rafael
"
As part of avoiding static initialization order problems I want to
switch a few global sstring to constexpr std::string_view. The
advantage being that a constexpr variable doesn't need runtime
initialization and therefore cannot be part of a static initialization
order problem.

In order to do the conversion I needed to convert a few APIs to use
std::string_view instead of sstring and const sstring&.

These patches are the simple cases that are also an improvement in
their own right.
"

* 'espindola/string_view' of https://github.com/espindola/scylla: (22 commits)
  test: Pass a string_view to create_table's callback
  Pass string_view to the schema constructor
  cql3: Pass string_view to the column_specification constructor
  Pass string_view to keyspace_metadata::new_keyspace
  Pass string_view to the keyspace_metadata constructor
  utils: Use std::string as keys in nonstatic_class_registry
  utils: Pass a string_view to class_registry::to_qualified_class_name
  auth: Return a string_view from authorizer::qualified_java_name
  auth: Return a string_view from authenticator::qualified_java_name
  utils: Pass string_view to is_class_name_qualified
  test: Pass a string_view to create_keyspace
  Pass string_view to no_such_column_family's constructor
  perf_simple_query: Pass a string_view to make_counter_schema
  Pass string_view to the schema_builder constructor
  types: Add more data_value constructors
  transport: Pass a string_view to cql_server::connection::make_autheticate
  transport: Pass a string_view to cql_server::response::write_string
  cql3: Pass std::string_view to query_processor::compute_id
  cql3: Remove unused variable
  cql3: Pass a string_view to cf_statement::prepare_keyspace
  ...
2020-03-01 14:22:28 +02:00
Rafael Ávila de Espíndola
b3d396ea1f utils: Use on_internal_error from seastar
With this change abort_on_internal_error is enable on every
SEASTAR_TEST_CASE.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200227164823.21021-1-espindola@scylladb.com>
2020-02-29 19:28:57 +02:00
Pavel Emelyanov
3ab43eba01 validation: Cleanup validate_keyspace helpers
One of them uses global storage_proxy instance, but since
it is not used -- remove it not to encourage anybody to start
calling one.

Another call uses the db.find_keyspace to check if a keyspace
exists, while there's a nicer db.has_keyspace helper (which
doesn't throw exceptions) so use it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200228123644.13931-1-xemul@scylladb.com>
2020-02-29 19:28:57 +02:00
Rafael Ávila de Espíndola
80bfe91a20 test: Pass a string_view to create_table's callback
This gives more flexibility to the create_table implementation.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 17:04:12 -08:00
Rafael Ávila de Espíndola
151f5e723f Pass string_view to the schema constructor
This moves string copies from the callers of the constructor to the
implementation.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 17:04:12 -08:00
Rafael Ávila de Espíndola
fba071163e cql3: Pass string_view to the column_specification constructor
This moves sstring copies from the callers to the constructor
implementation.

While at it, move the implementation out-of-line.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 17:04:12 -08:00
Rafael Ávila de Espíndola
ba453d832b Pass string_view to keyspace_metadata::new_keyspace
This avoids a few sstring copies.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 17:04:12 -08:00
Rafael Ávila de Espíndola
94d07fba07 Pass string_view to the keyspace_metadata constructor
This avoids a few sstring copies when constructing keyspace_metadata.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 17:04:12 -08:00
Rafael Ávila de Espíndola
01fe766f1f utils: Use std::string as keys in nonstatic_class_registry
The sstring::compare functions was never updated to work with
std::string_view. We could fix that, but it seems better to just
switch to std::string.

With a working compare function we can avoid copying the argument
passed to to_qualified_class_name when an entry is found in the map.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 17:04:08 -08:00
Rafael Ávila de Espíndola
31985d3c28 utils: Pass a string_view to class_registry::to_qualified_class_name
This just moves a string copy from the caller to the implementation.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 13:30:00 -08:00
Rafael Ávila de Espíndola
df4f1a3bc3 auth: Return a string_view from authorizer::qualified_java_name
This gives more flexibility to the implementations as they now don't
need to construct a sstring.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 11:45:22 -08:00
Rafael Ávila de Espíndola
c29f8caafc auth: Return a string_view from authenticator::qualified_java_name
This gives more flexibility to the implementations as they now don't
need to construct a sstring.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 11:32:36 -08:00
Rafael Ávila de Espíndola
fae05e9268 utils: Pass string_view to is_class_name_qualified
With this we don't need to construct a sstring just to call
is_class_name_qualified.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 08:36:27 -08:00
Rafael Ávila de Espíndola
0b57bddb3e test: Pass a string_view to create_keyspace
With this we don't need to construct a sstring just to call
create_keyspace.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 08:36:27 -08:00
Rafael Ávila de Espíndola
2b96abcece Pass string_view to no_such_column_family's constructor
With this we don't have to construct a sstring to construct a
no_such_column_family.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 08:36:27 -08:00
Rafael Ávila de Espíndola
2679c0cc87 perf_simple_query: Pass a string_view to make_counter_schema
With this we don't need to construct a sstring just to call
make_counter_schema.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 08:36:27 -08:00
Rafael Ávila de Espíndola
9ab2346e7f Pass string_view to the schema_builder constructor
With this we don't need to construct a sstring just to construct a
schema_builder.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 08:36:27 -08:00
Rafael Ávila de Espíndola
93de9597bf types: Add more data_value constructors
With this we can construct a data_value from any string type. This
also avoids a few sstring copies.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 08:36:27 -08:00
Rafael Ávila de Espíndola
c51d81341b transport: Pass a string_view to cql_server::connection::make_autheticate
With this we don't need to construct a sstring just to call
make_autheticate.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 08:36:27 -08:00
Rafael Ávila de Espíndola
c2c44f4778 transport: Pass a string_view to cql_server::response::write_string
With this we don't need to construct a sstring just to call
write_string.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 08:36:27 -08:00
Rafael Ávila de Espíndola
4adefd9a76 cql3: Pass std::string_view to query_processor::compute_id
With this we don't need to construct a sstring just to call
compute_id.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 08:36:27 -08:00
Rafael Ávila de Espíndola
f44a5255da cql3: Remove unused variable
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 08:36:27 -08:00
Rafael Ávila de Espíndola
9e00f1e23b cql3: Pass a string_view to cf_statement::prepare_keyspace
This avoids a copy in the callers. While at it, also make this
function non-virtual since it is never overwritten.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 08:36:27 -08:00
Rafael Ávila de Espíndola
2fd3ec8d6f cql3: Pass a string_view to keyspace_element_name::set_keyspace
With this we don't need to construct a sstring just to call
set_keyspace.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 08:36:27 -08:00
Rafael Ávila de Espíndola
35089447cd cql3: Pass a string_view to keyspace_element_name::to_internal_name
This moves the string copy from the callers to the implementation of
to_internal_name.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-28 08:36:27 -08:00
Botond Dénes
5b0cfbb51f test/boost/mutation_reader_test: test_multishard_streaming_reader: use caller's priority class
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200228073239.475778-1-bdenes@scylladb.com>
2020-02-28 16:39:30 +01:00
Avi Kivity
134d5a5f75 Merge "flat_mutation_reader: abort reverse reads when size of mutation exceeds limit" from Botond
"
Reverse queries work by reading an entire partition into memory, then
start emitting its rows in reverse order. It is easy to see how this can
lead to disasters combined with large partitions. In fact a handful of
such reverse queries on large partitions is enough to bring a node down.
To prevent this, abort reverse queries, when we find out that the size
of the partition is larger than a limit. This might be annoying to users,
but I'm sure it is not as annoying as their nodes going down.

The limit is configurable via `max_memory_for_unlimited_query`
configuration option, which is 1MB by default. This limit is propagated
to each table, system tables having no limit. This limit is planned to
be used by other queries capable of consuming unlimited amount of
memory, like unpaged queries. Not in this series.

The proper solution would be to read the data in reverse (#1413), but
that is a major effort. In the meanwhile make sure the unsuspecting user
won't bring their nodes down with an innocent looking ordering
directive.

Note that for calculating the memory footprint of the
partition-in-question, only the clustering rows are used. This should be
fine, the 1MB limit is conservative enough that an eventual overshoot
caused by the omitted range tombstones and the static row would not make
a big difference.

Fixes: #5804
"

* 'limit-reverse-query-memory-consumption/v3' of https://github.com/denesb/scylla:
  flat_mutation_reader: make_reversing_reader(): add memory limit
  db/config: add config memory limit of otherwise unlimited queries
  utils::updateable_value: add operator=(T)
  flat_mutation_reader: expose reverse reader as a standalone reader
2020-02-28 07:57:13 +02:00
Rafael Ávila de Espíndola
e670dfc0cd auth: Fix static initialization order problem
A static constructor was used to initialize update_row_query. That
constructor would call meta::roles_table::qualified_name() which would
access AUTH_KS which is also initialized by a static constructor in
another file, so the construction order is not guaranteed.

This change turns update_row_query into a function with a static local
variable in it. The static local is initialized at first use, fixing
the problem.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200227163916.19761-1-espindola@scylladb.com>
2020-02-28 07:57:13 +02:00
Nadav Har'El
7953f7c65f merge "alternator: Make parsing yieldable"
Merged patch series by Piotr Sarna:

This series makes json parsing yieldable in order
to prevent reactor stalls. It's done by:
 1. Extracting the parsing stage out of alternator executor
 2. Moving the parsing stage to a separate service,
    which uses a static seastar thread (parallelism: 1)
 3. Wrapping rjson parsing routines with a yieldable parser,
    which takes advantage of running in a seastar thread
    and occasionally performs maybe_yield()

Step 2 above is only used for JSON's big enough to potentially
create stalls - small requests will be parsed immediately,
without being redirected to a static thread.

Handling a PutItem operation with large JSONs
on my machine takes approximately:
 1MB doc:  ~30ms
 3MB doc:  ~90ms
 12MB doc: ~350ms

out of which parsing itself is around:
 1MB doc:  ~7ms
 3MB doc:  ~20ms
 12MB doc: ~80ms
 (bonus: 400KiB doc: ~2ms)

; the document was a single object full of small items,
which triggers many allocations during parsing.
The above numbers were roughly the same before and after
the series, but the 12MB document did not cause reactor
stalls after the patch.
Note: writing the JSON can still be a source of stalls,
especially for large documents.
Note2: DynamoDB limits single value size to 400KiB,
       but for batches it will be 16MiB total request size
Note3: If parallelism ever proves to be an issue,
       it's easily increasable by spawning more static threads.

Refs: #5742
Tests: alternator(local)
       manual

Piotr Sarna (12):
  alternator: break lines in server callbacks
  alternator: allow moving the request from rmw operation
  alternator: move parsing in front of executor
  alternator: convert parse to std::string_view
  alternator: implement json parser inside the server
  alternator: remove rjson::parse_raw
  alternator: make rjson yieldable in thread context
  alternator: fix returning raw JSON errors
  alternator: change json errors class to SerializationException
  alternator-test: rename large requests test to 'manual requests'
  alternator-test: extract getting signed request helper
  alternator-test: add tests for incorrect JSON documents

 ...ge_requests.py => test_manual_requests.py} |  53 +++--
 alternator/executor.cc                        | 203 ++++++++----------
 alternator/executor.hh                        |  33 +--
 alternator/rjson.cc                           |  47 +++-
 alternator/rjson.hh                           |   7 +-
 alternator/rmw_operation.hh                   |   1 +
 alternator/serialization.cc                   |   9 +-
 alternator/server.cc                          | 111 ++++++++--
 alternator/server.hh                          |  20 +-
 9 files changed, 310 insertions(+), 174 deletions(-)
 rename alternator-test/{test_large_requests.py => test_manual_requests.py} (70%)
2020-02-28 07:57:13 +02:00
Benny Halevy
b31867eafa types: tri_compare: turn marshal_exception to on_internal_error
We see this exception on gemini testing with large number of pk, ck, columns, for example:
  2020-02-19T17:52:54+00:00  gemini-8h-large-num-columns-GeminiL-db-node-f2d6a8e0-3 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.207.169: std::runtime_error (marshaling error: read_simple_exactly - size mismatch (expected 4, got 1) Backtrace:   0x2c4f08d#012  0x9fcd3e#012  0x444b28#012  0x4d8fe5#012  0xa78e8b#012  0xeab269#012  0xc27a67#012  0xc28239#012  0xc600e3#012  0xadebf3#012  0xae14c1#012  0x29ff291#012  0x29ff49f#012  0x2a3fc65#012  0x29a5d6f#012  0x29a6e9e#012  0x72a4e3#012  /opt/scylladb/libreloc/libc.so.6+0x271a2#012  0x77548d#012)

Decoded backtrace:
  seastar::current_backtrace() at crtstuff.c:?
  seastar::internal::backtraced<marshal_exception>::backtraced<seastar::basic_sstring<char, unsigned int, 15u, true> >(seastar::basic_sstring<char, unsigned int, 15u, true>&&) at crtstuff.c:?
  void seastar::throw_with_backtrace<marshal_exception, seastar::basic_sstring<char, unsigned int, 15u, true> >(seastar::basic_sstring<char, unsigned int, 15u, true>&&) at crtstuff.c:?
  abstract_type::compare(std::basic_string_view<signed char, std::char_traits<signed char> >, std::basic_string_view<signed char, std::char_traits<signed char> >) const [clone .cold] at types.cc:?
  bound_view::tri_compare::operator()(clustering_key_prefix const&, int, clustering_key_prefix const&, int) const at crtstuff.c:?
  sstables::sstable_mutation_reader<sstables::data_consume_rows_context_m, sstables::mp_row_consumer_m>::fast_forward_to(position_range, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at crtstuff.c:?
  mutation_reader_merger::fast_forward_to(position_range, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at crtstuff.c:?
  combined_mutation_reader::fast_forward_to(position_range, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at crtstuff.c:?
  restricting_mutation_reader::fast_forward_to(position_range, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at crtstuff.c:?
  cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at crtstuff.c:?

This patch should help us get a core dump if this happens again.

Ref #5856

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200227131939.388770-1-bhalevy@scylladb.com>
2020-02-28 07:57:13 +02:00
Piotr Sarna
b461750ae3 alternator-test: add tests for incorrect JSON documents
The test case sends incorrectly formed JSON documents to alternator,
expecting a serialization exception as a response.
2020-02-28 07:57:12 +02:00
Raphael S. Carvalho
40e75fb109 streaming/stream_transfer_task: avoid pointless iterations in has_relevant_range_on_this_shard()
When has_relevant_range_on_this_shard() found a relevant range, it will unnecessarily
iterate through the end. Verified manually that this could be thousands of pointless
iterations when streaming data to a node just added. The relevant code could be
simplified by de-futurizing it but I think it remains so to allow task scheduler
to preempt it if necessary.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200220224048.28804-2-raphaelsc@scylladb.com>
2020-02-28 07:57:12 +02:00
Piotr Sarna
79b04aeba9 alternator-test: extract getting signed request helper
A helper function for getting custom requests is extracted
to top-level, in order to be used later by other test cases.
2020-02-28 07:57:12 +02:00
Raphael S. Carvalho
8a986bc23b streaming/stream_transfer_task: avoid unecessary copies of ranges
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200220224048.28804-1-raphaelsc@scylladb.com>
2020-02-28 07:57:12 +02:00
Piotr Sarna
ad48328407 alternator-test: rename large requests test to 'manual requests'
This test suite can then be the parent of tests which use custom,
potentially not validated input in order to test alternator
against data not easy to push via boto3 or Python, due to their
implementation details.
2020-02-28 07:57:12 +02:00
Piotr Sarna
ccdf519829 alternator: make alternator server sharded
Previously, alternator server was not directly sharded - and instead
kept a helper http server control class, which stored sharded http
server inside. That design is confusing and makes it hard to expand
alternator server with new sharded attributes, so from now on
the alternator server is itself sharded<>.

Tests: alternator-test(local, smp==1&smp==4)
Fixes #5913
Message-Id: <b50e0e29610c0dfea61f3a1571f8ca3640356782.1582788575.git.sarna@scylladb.com>
2020-02-28 07:57:12 +02:00
Piotr Sarna
c370586189 alternator: change json errors class to SerializationException
In order to be consistent with DynamoDB - a parsing error on incorrect
JSON input is reported as SerializationException instead of
ValidationException.
2020-02-28 07:57:12 +02:00
Piotr Sarna
6f8c70d54b alternator: fix returning raw JSON errors
A couple of places in executor code leaked raw JSON errors to the user
instead of formulating a proper ValidationException message.
These places are now fixed, and the next patch in this series will
act as a regression checker, since all JSON errors will be returned
as SerializationException, not ValidationException instances.
2020-02-28 07:57:12 +02:00
Piotr Sarna
1be1cfc5d8 alternator: make rjson yieldable in thread context
In order to fight reactor stalls, rjson parsing and writing
routines can now yield if they run in seastar thread context.
In order to run a yieldable version of the parser which needs
to be run in seastar thread context, use parse_yieldable()
instead of parse().
2020-02-28 07:57:12 +02:00
Piotr Sarna
0af8516675 alternator: remove rjson::parse_raw
With parse() being based on std::string_view, there's not much
sense in keeping a separate parse_raw function, so it's deleted.
2020-02-28 07:57:12 +02:00
Piotr Sarna
aad6c01b98 alternator: implement json parser inside the server
The json parser runs in a static thread which accepts and parses
documents. Documents smaller than a parsing threshold
(currently: 16KiB) will be parsed in place without yielding.
The assumption is that most alternator requests are small
and there's no need to parse them in a yieldable way,
which also induces overhead. For reference, parsing a 128KiB
document made of many small objects with rapidjson takes
around 0.5 millisecond, and a 16KiB document is parsed
in around 0.06ms - a value small enough not to disturb
Seastar's current value of  0.5ms task quota too much.
2020-02-28 07:57:12 +02:00
Piotr Sarna
ffdbbc0ad0 alternator: convert parse to std::string_view
The original implementation used const std::string&,
which is less versatile.
2020-02-28 07:57:12 +02:00
Piotr Sarna
2402955d45 alternator: move parsing in front of executor
Parsing a request string into JSON happens as a first thing
in every request, so it can be performed before calling
any executor callbacks. The most important thing however,
is that making parsing a separate stage allows certain optimizations,
e.g. running all parsing in a single seastar thread, which allows
adding yields to rjson parsing later.
2020-02-28 07:57:12 +02:00
Piotr Sarna
c20432bcac alternator: allow moving the request from rmw operation
In order to elide copying the JSON value when rerouting
the operation to another shard - a way to move the parsed
request from the operation is added.
2020-02-28 07:57:12 +02:00
Piotr Sarna
c7a8549270 alternator: break lines in server callbacks
The lines are about to get longer, so they are broken
as a first step, to make the next commits more clear.
2020-02-28 07:57:12 +02:00
Botond Dénes
1073094f04 database: database::query*(), database::apply*(): remove default timeouts 2020-02-27 19:14:12 +02:00
Botond Dénes
2c1ee7b9cd database: table::query(): remove default timeout 2020-02-27 19:14:09 +02:00
Botond Dénes
8da88e6cb9 mutation_query: data_query(): remove default timeout 2020-02-27 19:02:40 +02:00
Botond Dénes
fdb45d16de mutation_query: mutation_query(): remove default timeout 2020-02-27 18:56:30 +02:00
Botond Dénes
72509911d9 multishard_mutation_query: query_mutations_on_all_shards(): remove default timeout 2020-02-27 18:45:15 +02:00
Botond Dénes
f6013a39ec reader_concurrency_semaphore: wait_admission(): remove default timeout 2020-02-27 18:43:12 +02:00
Botond Dénes
93039a085d utils/logallog: run_when_memory_available(): remove default timeout 2020-02-27 18:36:32 +02:00
Botond Dénes
7bdeec4b00 flat_mutation_reader: make_reversing_reader(): add memory limit
If the reversing requires more memory than the limit, the read is
aborted. All users are updated to get a meaningful limit, from the
respective table object, with the exception of tests of course.
2020-02-27 18:11:54 +02:00
Botond Dénes
75efa707ce db/config: add config memory limit of otherwise unlimited queries
We have a few kind of queries whose memory consumption is not limited at
all. One of these is reverse queries, which reads entire partitions into
memory, before reversing them. These partitions can be larger than
memory and thus such a query can single-handedly cause OOM.
This patch introduces a configuration for a memory limit for such
queries. This will serve as a hard limit and queries which attempt to
use more memory than this, will be aborted.
The limit is propagated to table objects, with the intention of keeping
system tables unlimited. These tables are usually small and initiators
of system queries are not prepared for failures.
2020-02-27 18:11:54 +02:00
Botond Dénes
d1194da98d utils::updateable_value: add operator=(T)
Allow assigning a const value.
2020-02-27 18:11:54 +02:00
Botond Dénes
091d80e8c3 flat_mutation_reader: expose reverse reader as a standalone reader
Currently reverse reads just pass a flag to
`flat_mutation_reader::consume()` to make the read happen in reverse.
This is deceptively simple and streamlined -- while in fact behind the
scenes a reversing reader is created to wrap the reader in question to
reverse partitions, one-by-one.

This patch makes this apparent by exposing the reversing reader via
`make_reversing_reader()`. This now makes how reversing works more
apparent. It also allows for more configuration to be passed to the
reversing reader (in the next patches).

This change is forward compatible, as in time we plan to add reversing
support to the sstable layer, in which case the reversing reader will
go.
2020-02-27 18:11:54 +02:00
Dejan Mircevski
0d7457946f cql3: Allow repeated LIKE on same column
No reason to disallow this.  We still forbid mixing LIKE and non-LIKE
relations on the same column.

Fixes #5902.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-02-27 09:34:51 -05:00
Pekka Enberg
109bb1baa6 cql3: Switch from distributed<> to seastar::sharded<>
Convert the last instance of "distributed<>" in cql3 to seastar::sharded<>.

Message-Id: <20200227092804.27374-1-penberg@scylladb.com>
2020-02-27 12:09:59 +02:00
Pekka Enberg
123b50cdb9 configure.py: Disable package registry when building Seastar
The CMake build system in seastar.git exports the package to CMake
package registry. However, we don't use it when building from scylla.git
(we link to seastar directly) and get the following warning when
building with "dbuild" (that does not bind mount $HOME/.cmake):

  CMake Warning at CMakeLists.txt:1180 (export):
    Cannot create package registry file:
      /home/penberg/.cmake/packages/Seastar/3b6ede62290636bbf1ab4f0e4e6a9e0b
    No such file or directory

Let's just disable the package registry for our builds by setting the
CMAKE_EXPORT_NO_PACKAGE_REGISTRY CMake option as discussed here to make
the warning go away:

  https://cmake.org/cmake/help/v3.4/variable/CMAKE_EXPORT_NO_PACKAGE_REGISTRY.html

Message-Id: <20200227092743.27320-1-penberg@scylladb.com>
2020-02-27 12:09:59 +02:00
Takuya ASADA
01a03c4d69 install.sh: run post-install script just like .rpm/.deb package
To install scylla using install.sh easily, we need to run following things:
 - add scylla user/group
 - configure scylla.yaml
 - run scylla_post_install.sh

But we don't want to run them when we build .rpm/.deb package,
we also need to add --packaging option to skip them.

Fixes #5830
2020-02-27 11:17:24 +02:00
Dejan Mircevski
acccab31f7 cql3: Forbid calling LIKE::values()
We were incorrectly returning the LIKE pattern as if it were a column
value.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-02-26 14:07:46 -05:00
Dejan Mircevski
fd583196ce cql3: Move LIKE::_last_pattern to matcher
Instead of keeping the LIKE pattern in a restriction object (as we
currently do), keep it in like_matcher.  Also move the
pattern-idempotence check from the restriction to the matcher.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-02-26 14:00:04 -05:00
Avi Kivity
956b092012 Merge "Repair based node operation" from Asias
"
Here is a simple introduction to the node operations scylla supports and
some of the issues.

 - Replace operation

It is used to replace a dead node. The token ring does not change. It
pulls data from only one of the replicas which might not be the
latest copy.

 - Rebuild operation

It is used to get all the data this node owns form other nodes. It
pulls data from only one of the replicas which might not be the
latest copy.

 - Bootstrap operation

It is used to add a new node into the cluster. The token ring
changes. Do no suffer from the "not the latest replica” issue. New
node pulls data from existing nodes that are losing the token range.

Suffer from failed streaming. We split the ranges in 10 groups and we
stream one group at a time. Restream the group if failed, causing
unnecessary data transmission on wire.

Bootstrap is not resumable. Failure after 99.99% of data is streamed.
If we restart the node again, we need to stream all the data again
even if the node already has 99.99% of the data.

 - Decommission operation

It is used to remove a live node form the cluster. Token ring
changes. Do not suffer “not the latest replica” issue. The leaving
node pushes data to existing nodes.

It suffers from resumable issue like bootstrap operation.

 - Removenode operation

It is used to remove a dead node out of the cluster. Existing nodes
pulls data from other existing nodes for the new ranges it own. It
pulls from one of the replicas which might not be the latest copy.

To solve all the issues above. We could use repair based node operation.
The idea behind repair based node operations is simple: use repair to
sync data between replicas instead of streaming.

The benefits:

 - Latest copy is guaranteed

 - Resumable in nature

 - No extra data is streamed on wire
   E.g., rebuild twice, will not stream the same data twice

 - Unified code path for all the node operations

 - Free repair operation during bootstrap, replace operation and so on.

Fixes: #3003
Fixes: #4208
Tests: update_cluster_layout_tests.py + replace_address_test.py + manual test
"

* 'repair_for_node_ops' of https://github.com/asias/scylla:
  docs: Add doc for repair_based_node_ops
  storage_service: Enable node repair based ops for bootstrap
  storage_service: Enable node repair based ops for decommission
  storage_service: Enable node repair based ops for replace
  storage_service: Enable node repair based ops for removenode
  storage_service: Enable node repair based ops for rebuild
  storage_service: Use the same tokens as previous bootstrap
  storage_service: Add is_repair_based_node_ops_enabled helper
  config: Add enable_repair_based_node_ops
  repair: Add replace_with_repair
  repair: Add rebuild_with_repair
  repair: Add do_rebuild_replace_with_repair
  repair: Add removenode_with_repair
  repair: Add decommission_with_repair
  repair: Add do_decommission_removenode_with_repair
  repair: Add bootstrap_with_repair
  repair: Introduce sync_data_using_repair
  repair: Propagate exception in tracker::run
2020-02-26 20:37:25 +02:00
Avi Kivity
35e5772b94 Update seastar submodule
* seastar 7a3b4b4e4e...affc3a5107 (6):
  > Merge "Add the possibility to remove rules from routes" from Pavel
  > stall_detector: expose correct clock type to use
  > queue: add has_blocked_consumer() function
  > Merge "core: reduce memory use for idle connections" from Avi
  > testing: Enable abort_on_internal_error on tests
  > core: Add a on_internal_error helper
2020-02-26 19:21:24 +02:00
Rafael Ávila de Espíndola
17f12a8197 perf_simple_query: Call set_abort_on_internal_error(true)
We should never ignore an internal error in a perf test.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200225055745.321086-2-espindola@scylladb.com>
2020-02-26 18:22:05 +02:00
Rafael Ávila de Espíndola
c6897dcbea perf_simple_query: Simplify with seastar::thread
There is no reason not to use a seastar::thread in setup code.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200225055745.321086-1-espindola@scylladb.com>
2020-02-26 18:22:04 +02:00
Nadav Har'El
3e44356c9f alternator-test: fix tests failing with HTTPS
When we test Alternator on its HTTPS port (i.e., pytest --https),
we don't want requests to verify the pedigree of the SSL certificate.
Our "dynamodb" fixture (conftest.py) takes care of this for most of
the tests, but a few tests create their own requests and need to pass the
"verify=False" option on their own. In some tests, we forgot to do
this, and this patch fixes three tests which failed with "pytest --https".

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200226142330.27846-1-nyh@scylladb.com>
2020-02-26 15:29:24 +01:00
Nadav Har'El
cf8354f703 merge "cdc: Fix operation value for row deletes"
Merged pull request https://github.com/scylladb/scylla/pull/5897
from Juliusz Stasiewicz:

Column operation now contains operation::row_delete (== 2)
after queries like delete from tbl where pk=x and ck=y;. Before
this patch row deletes were treated as updates, which was incorrect
because updates do not contain row tombstones (and row deletes do).

Refs #5709
2020-02-26 16:26:34 +02:00
Juliusz Stasiewicz
f425f7d217 tests/cdc: added test for row delete <-> update differentiation 2020-02-26 12:32:16 +01:00
Juliusz Stasiewicz
836183b847 cdc: fix operation value for row deletes
Column `operation` now contains `operation::row_delete` (== 2)
after queries like `delete from tbl where pk=x AND ck=y;`. Before
this patch row deletes were treated as updates, which was incorrect
because updates do not contain row tombstones (and row deletes do).

Refs #5709
2020-02-26 11:58:50 +01:00
Nadav Har'El
6da4d65f12 merge: Fix alternator decommision/shutdown
Merged patch series from Piotr Sarna:

Alternator shutdown routines were only registered in main.cc,
but it's not enough - other operations, like decommision,
also rely on shutting down client servers.
In order to remedy the situation, a notion of client shutdown
listeners is introduced to storage service.
A shutdown listener implements a callback used by the storage
service when client servers need to shut down, and at the same
time it does not force storage service to keep a reference
for the client service itself.
NOTE: the interface can also be used later to provide
proper shutdown routines for redis and any other future APIs.

Fixes #5886
Tests: alternator-test(local, including a shutdown during the run)

Piotr Sarna (4):
  storage_service: make shutdown_client_servers() thread-only
  storage_service: add client shutdown hook
  main: make alternator shutdown hook-based
  main: reduce scope of alternator services

 main.cc                    | 18 +++++++++---------
 service/storage_service.cc | 22 +++++++++++++++++-----
 service/storage_service.hh | 15 ++++++++++++++-
 3 files changed, 40 insertions(+), 15 deletions(-)
2020-02-26 12:45:30 +02:00
Botond Dénes
a83cca93ff scylla-gdb.py: introduce std_deque
A python read-only container wrapper for std::deque.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200225184951.125129-1-bdenes@scylladb.com>
2020-02-26 11:20:50 +01:00
Takuya ASADA
65aadad9a6 dist/common/scripts/scylla_coredump_setup: bind-mount coredump directory, add coredump test
On some environment systemd-coredump does not work with symlink directory,
we can use bind-mount instead.
Also, it's better to check systemd-coredump is working by generating coredump.

Fixes #5753
2020-02-26 11:21:48 +02:00
Takuya ASADA
8e901636fc scylla_setup: fix --nic option on non-interactive mode
scylla_setup should not shows up NIC selection prompt on non-interactive mode.

Fixes #5725

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2020-02-26 11:13:53 +02:00
Piotr Sarna
148456a741 main: reduce scope of alternator services
With the new shutdown routines in place, alternator executor
and server do not need to be declared outside of the `if` clause
which conditionally sets up alternator.
2020-02-26 08:45:07 +01:00
Piotr Sarna
33ce8379ba main: make alternator shutdown hook-based
In order to properly handle not only shutdown, but also
decommission, drain and similar operations, alternator
shutdown is now registered as a client shutdown hook,
which allows storage service to trigger its shutdown routines.

Fixes #5886
2020-02-26 08:44:56 +01:00
Piotr Sarna
8d499603aa storage_service: add client shutdown hook
The shutdown hook interface can be used later by additional
client interfaces (e.g. alternator, redis) to register
shutdown routines for various operations: Scylla shutdown,
node decommission, drain, etc. It also decouples
the services themselves from being part of the storage
service, since it's huge enough as it is.
2020-02-26 08:44:35 +01:00
Piotr Sarna
171bc9a3df storage_service: make shutdown_client_servers() thread-only
The function is only ever called in thread context, so it's moved
from being future<>-based in order to ease future changes.
2020-02-26 08:18:42 +01:00
Nadav Har'El
0ab6c7fcef alternator: stricter checks for user-supplied attribute values
Until now, PutItem or UpdateItem could be used to insert almost any JSON
as an attribute's value - even those that do not match DynamoDB's typed
value specification.

Among other things, the new validation allows us to reject empty sets,
strings or byte arrays - which are (somewhat artificially) forbidden in
DynamoDB.

Also added tests for the empty sets, strings and byte arrays that should
be rejected.

Fixes #5896

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200225150525.4926-1-nyh@scylladb.com>
2020-02-26 08:12:26 +01:00
Nadav Har'El
6339f419ac alternator: removing all elements from a set should delete it
DynamoDB does not support empty sets. Operations which remove elements
from a set attribute should remove the attribute when the last item is
removed - not leave an empty set as it incorrectly does now.

Incidentally, the same patch fixes another bug - deleting elements from
a non-existent set attribute should be allowed (and do nothing), not fail
as it does now.

This patch also includes tests for both bugs.

Fixes #5895

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200225125343.31629-1-nyh@scylladb.com>
2020-02-26 08:12:19 +01:00
Nadav Har'El
acb7f45ca7 alternator-test: add tests for UpdateItem's AttributeUpdates DELETE and ADD
We have not yet implemented the DELETE-with-value and ADD operations in
UpdateItem's old-style "AttributeUpdates" parameter - see issue #5864
and issue #5893, respectively

This patch include comprehensive tests for both features. The new tests
pass on DynamoDB, but currently xfails on Alternator - until these
features will be implemented.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200225105546.25651-1-nyh@scylladb.com>
2020-02-26 08:12:10 +01:00
Botond Dénes
ea08d7a0df scylla-gdb.py: make get_text_range() more reliable
Currenly `get_text_range()` uses heuristics about which ELF section
actually contains the text for the main executable. It appears that this
fails from time-to-time and we have to adjust the heuristics.
We don't really have to guess however, a much better method of
determining the section hosting text is to find a vtable pointer and
locate the section it resides in. For this, we use the
`reactor::_backend` as a canary. When this is not available, we fall
back to the pre-existing heuristics.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200225164719.114500-1-bdenes@scylladb.com>
2020-02-25 19:02:26 +01:00
Calle Wilund
a3a764fd10 cdc: Handle non-atomic columns
Fixes #5669

This implements non-atomic collection and UDT handling for
both cdc preimage + delta.

To be able to express deltas in a meaningful way (and reconstruct
using it), non-atomic values are represented somewhat
differently from regular values:

* maps - stored as is (frozen)
* sets - stored as is (frozen)
* lists - stored as map<timeuuid, value> (frozen)
  this allows reconstructing the list, as otherwise
  things like list[0] = value cannot be represented
  in a meaningful way
* udt - stored as tuple<tuple<field0>, tuple<field1>...> (frozen)
  UDTs are normally just tuples + metadata, but we need to
  distinguish the case of outer tuple element == null, meaning
  "no info/does not partake in mutation" from tuple element
  being a tuple(null) (i.e. empty tuple), meaning "set field to
  null"
2020-02-25 19:34:54 +02:00
Avi Kivity
d17ebde46b Update seastar submodule
* seastar 8b6bc659c7...7a3b4b4e4e (3):
  > Merge "Add custom stack size to seastar threads" from Piotr
Ref #5742.
  > expiring_fifo: Optimize memory usage for single-element lists
Ref #4235.
  > Close connection, when reach to max retransmits
2020-02-25 18:02:25 +02:00
Pavel Emelyanov
7363d56946 sstables: Move get_highest_supported_format
The global get_highest_supported_format helper and its declaration
are scattered all over the code, so clean this up and prepare the
ground for moving _sstables_format from the storage_service onto
the sstables_manager (not this set).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-25 14:31:45 +03:00
Pavel Emelyanov
792cec39df sstables: Remove global get_config() helper
Finally, the thing is not used by anyone and can be removed.
This greatly relaxes the sstables -> storage_service dependency.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Applauded-by: Benny Halevy <bhalevy@scylladb.com>
2020-02-25 14:31:45 +03:00
Pavel Emelyanov
1af065296e sstables: Use manager's config() in .new_sstable_component_file()
This is the last place left that calls for global get_config(),
switch it onto _sst_manager.config().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-25 14:31:43 +03:00
Pavel Emelyanov
5dea657991 sstable_writer_config: Extend with more db::config stuff
The enable_sstable_key_validation and summary_bytes_cost are used
in sstables writing code, keeping them on sstable_writer_config
removes more calls to global get_config().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-25 14:31:34 +03:00
Pavel Emelyanov
85d9326d70 sstables_manager: Don't use global helper to generate writer config
The main goal of this patch is to stop using get_config() glbal
when creating the sstable_writer_config instance.

Other than being global the existing get_config() is also confusing
as it effectively generates 3 (three) sorts of configs -- one for
scylla, when db config and features are ready, the other one for
tests, when no storage service is at hands, and the third one for
tests as well, when the storage service is created by test env
(likely intentionally, but maybe by coincidence the resulting config
is the same as for no-storage-service case).

With this patch it's now 100% clear which one is used when. Also
this makes half the work of removing get_config() helper.

The db::config and feature_service used to initialize the managers
are referenced by database that creates and keeps managers on,
so the references are safe.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-25 14:31:04 +03:00
Pavel Emelyanov
3a603729d4 sstable_writer_config: Sanitize out some features fields initialization
Similar to previous patch -- initialize config fields from features
in configurator, not in default initializers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-25 14:31:04 +03:00
Pavel Emelyanov
34302a3e1c sstable_writer_config: Factor out some field initialization
The promoted_index_block_size is taken from db config in two places.
Factor this out and, at the same time, stop keeping it as std::optional.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-25 14:31:04 +03:00
Pavel Emelyanov
5adce3390c sstables: Generate writer config via manager only
The sstable_writer_config creation looks simple (just declare
the struct instance) but behind the scenes references storage
and feature services, messes with database config, etc.

This patch teaches the sstables_manager generate the writer
config and makes the rest of the code use it. For future
safety by-hands creation of the sstable_writer_config is
prohibited.

The manager is referenced through table-s and sstable-s, but
two existing sstables_managers live on database object, and
table-s and sstable-s both live shorter than the database,
this reference is save.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-25 14:31:04 +03:00
Pavel Emelyanov
f289da1e3b sstables: Keep reference on manager
This is needed for further patching. The sstables_manager outlives
all sstables objects, so it's safe.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-25 14:31:03 +03:00
Pavel Emelyanov
e73e923e95 test: Re-use existing global sstables_manager
The sstables_manager in scylla binary outlives the sstables objects
created by it, this makes it possible to add sstable->manager reference
and use it. In unit tests there are cases when sstables::test_env that
keeps manager in _mgr field is destroyed right after sstable creation
(e.g. -- in the boost/sstable_mutation_test.cc ka_sst() helper).

Fix this by chaning the _mgr being reference on the manager and
initialize it with already existing global manager. Few exceptions
from this rule that need to set own large data handler will create
the sstable_manager their own.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-25 13:54:41 +03:00
Pavel Emelyanov
961f1642c7 table: Pass sstable_writer_config into write_memtable_to_sstable
The latter creates the config by hands, but the plan is to
create it via sstables_manager. Callers of this helper are the
final frontiers where the manager will be safely accessible.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-25 13:54:40 +03:00
Asias He
aaa1f3ce7b docs: Add doc for repair_based_node_ops
This patch adds a doc for the repair based node operations.
2020-02-25 08:54:35 +08:00
Asias He
ac90c1c184 storage_service: Enable node repair based ops for bootstrap
- Bootstrap operation

It is used to add a new node into the cluster. The token ring changes.
Do not suffer from the "not the latest replica” issue. New node pulls
data from existing nodes that are losing the token range.

Suffer from failed streaming. We split the ranges in 10 groups and we
stream one group at a time. Restream the group if failed, causing
unnecessary data transmission on wire.

Bootstrap is not resumable. Failure after 99.99% of data is streamed.
If we restart the node again, we need to stream all the data again even
if the node already has 99.99% of the data.

Fixes: #3003
Fixes: #4208
Tests: update_cluster_layout_tests.py + replace_address_test.py + manual test
2020-02-25 08:54:33 +08:00
Asias He
62f056c022 storage_service: Enable node repair based ops for decommission
- Decommission operation

It is used to remove a live node form the cluster. Token ring
changes.  Do not suffer “not the latest replica” issue. The leaving
node pushes data to existing nodes.

Fixes: #3003
Fixes: #4208
Tests: update_cluster_layout_tests.py + replace_address_test.py + manual test
2020-02-25 08:53:37 +08:00
Asias He
a38916121c storage_service: Enable node repair based ops for replace
- Replace operation

It is used to replace a dead node. The token ring does not change. It
pulls data from only one of the replicas which might not be the
latest copy.

Fixes: #3003
Fixes: #4208
Tests: update_cluster_layout_tests.py + replace_address_test.py + manual test
2020-02-25 08:53:36 +08:00
Glauber Costa
628dd16519 compaction: deprecate DTCS. Step 1.
This patch adds a warning of deprecation to DTCS. In a follow up step,
we will start requiring a flag for it to be enabled to make sure users
notice.

For now we'll just be nice and add a warning for the log watchers.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200224164405.9656-1-glauber@scylladb.com>
2020-02-24 20:26:24 +02:00
Takuya ASADA
5a7beef6a0 dist/common/scripts/scylla_coredump_setup: don't create /etc/sysctl.d/99-scylla-coredump.conf on CentOS8
We don't need to create 99-scylla-coredump.conf on CentOS8, the file is only
needed for CentOS7.

Fixes #5818
2020-02-24 17:38:47 +02:00
Takuya ASADA
fa423e25d4 scylla_setup: shows up usage when --nic is not specified & eth0 is not available
Since we set 'eth0' as default NIC name, we get following error when running scylla_setup in non-interactive mode without --nic parameter:

$ sudo scylla_setup --setup-nic-and-disks --no-raid-setup --no-verify-package --no-io-setup
NIC eth0 doesn't exist.

It looks strange since user actually does not specified 'eth0', they might forget to specify --nic.
I think we should shows up usage, when eth0 is not available on the system.

Fixes #5828
2020-02-24 17:35:40 +02:00
Piotr Dulikowski
41d82e39ea storage proxy: rename mutate_hint_from_scratch
Changes the name of storage_proxy::mutate_hint_from_scratch function to
another name, whose meaning is more clear: send_hint_to_all_replicas.

Tests: unit(dev)
2020-02-24 17:30:22 +02:00
Takuya ASADA
29285b28e2 dist/debian: fix "unable to open node-exporter.service.dpkg-new" error
It seems like *.service is conflicting on install time because the file
installed twice, both debian/*.service and debian/scylla-server.install.

We don't need to use *.install, so we can just drop the line.

Fixes #5640
2020-02-24 17:28:14 +02:00
Juliusz Stasiewicz
127e258ade cql3: Fix missing aggregate functions for counters
Aggregate functions on counters do not exist. Until now counters
could, at best, fall back to blob->blob overloads, e.g.:
```
cqlsh> select max(cnt) from ks.tbl;

 system.max(cnt)
----------------------
   0x000000000000000a
(1 rows)
cqlsh> select sum(entities) from ks.tbl;
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Invalid call to function sum, none of its type signatures match
[...]
```
Meanwhile, counters are compatible with bigints (aka. `long_type'),
so bigint overloads can be used on them (e.g. sum(bigint)->bigint).
This is achieved here by a special rule in overload resolution, which
makes `selector' perceive counters as an `EXACT_MATCH' to counter's
underlying type (`long_type', aka. bigint).
2020-02-24 17:14:44 +02:00
Juliusz Stasiewicz
0ea17216fe atomic_cell: special rule for printing counter cells
Until now, attempts to print counter update cell would end up
calling abort() because `atomic_cell_view::value()` has no
specialized visitor for `imr::pod<int64_t>::basic_view<is_mutable>`,
i.e. counter update IMR type. Such visitor is not easy to write
if we want to intercept counters only (and not all int64_t values).

Anyway, linearized byte representation of counter cell would not
be helpful without knowing if it consists of counter shards or
counter update (delta) - and this must be known upon `deserialize`.

This commit introduces simple approach: it determines cell type on
high level (from `atomic_cell_view`) and prints counter contents by
`counter_cell_view` or `atomic_cell_view::counter_update_value()`.

Fixes #5616
2020-02-24 17:11:34 +02:00
Benny Halevy
25a763a187 dist/redhat: scylla.spec.mustache: set _no_recompute_build_ids
By default, `/usr/lib/rpm/find-debuginfo.sh` will temper with
the binary's build-id when stripping its debug info as it is passed
the `--build-id-seed <version>.<release>` option.

To prevent that we need to set the following macros as follows:
  unset `_unique_build_ids`
  set `_no_recompute_build_ids` to 1

Fixes #5881

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-02-24 11:50:20 +02:00
Nadav Har'El
4b7577e429 alternator-test: correct typo "existant"
The official documentation language of Scylla is English, not French.
So correct the word "existant", which appeared several times throughout
Alternator's tests, to "existent".

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200221224221.31237-6-nyh@scylladb.com>
2020-02-24 10:40:53 +01:00
Nadav Har'El
e075eff915 alternator: complete implementation of ReturnValues parameter
This patch completes the support for the ReturnValues parameter for
the UpdateItem operation. This parameter has five settings - NONE, ALL_OLD,
ALL_NEW, UPDATED_OLD and UPDATED_NEW. Before this patch we already
supported NONE and ALL_OLD - and this patch completes the support for the
three remaining modes: ALL_NEW, UPDATED_OLD and UPDATED_NEW.

The patch also continues to improve test_returnvalues.py with additional
corner cases discovered during the development. After this patch, only
one xfailing test remains - testing updates to nested document paths,
which we do not yet support (even without the ReturnValues parameter).

After this patch, the support of ReturnValues is complete - for all
operations (UpdateItem, PutItem and DeleteItem) and all of its possible
settings.

Fixes #5053

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200221224221.31237-5-nyh@scylladb.com>
2020-02-24 10:40:53 +01:00
Nadav Har'El
1e500a2a34 alternator: rjson: another variant of set_with_string_name() utility
The rjson::set_with_string_name() utility function copies the given
string into the JSON key. The existing implementation required that this
input string be an std::string&, but a std::string_view would be fine too,
and I want to use it in new code to avoid yet another unnecessary copy.

Adding the overloads also exposes a few places where things were
implicitly converted to std::string and now cause an ambiguity - and
clearing up this ambiguity also allowed me to find places where this
conversion was unnecessary.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200221224221.31237-4-nyh@scylladb.com>
2020-02-24 10:38:54 +01:00
Nadav Har'El
fa5c2a4f58 alternator: UpdateItem only deleting attribute shouldn't create item
UpdateItem operations usually need to add a row marker:

 * An empty UpdateItem is supposed to create a new empty item (row).
   Such an empty item needs to have a row marker.

 * An UpdateItem to add an attribute x and then later an UpdateItem
   to remove this attribute x should leave an empty item behind.
   This means the first UpdateItem needed to add a row marker, so
   it will be left behind after the second UpdateItem.

So the existing code always added a row marker in UpdateItem.

However, there is one case where we should NOT create the row marker:
When the UpdateItem operation only has attribute deletions, and nothing
else, and it is applied to a key with no pre-existing item, DynamoDB
does not create this item. So neither should we.

This patch includes a new test for this test_update_item_non_existent,
which passes on DynamoDB, failed on Alternator before this patch, and
passes after the patch.

Fixes #5862.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200221224221.31237-3-nyh@scylladb.com>
2020-02-24 10:38:10 +01:00
Nadav Har'El
3cde949980 alternator-test: test for BatchWriteItem same key in two tables
In issue #5698 I raised a theory that we might have a bug when
BatchWriteItem is given two writes to the *same* key but in two different
tables. The test added here verifies that this theory was wrong, and
this case already works correctly.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200221224221.31237-2-nyh@scylladb.com>
2020-02-24 10:37:23 +01:00
Piotr Sarna
5e07c00eeb Merge 'Delete table snapshot' from Amnon
This series adds an option to the API that supports deleting
a specific table from a snapshot.
The implementation works in a similar way to the option
to specify specific keyspaces when deleting a snapshot.
The motivation is to allow reducing disk-space when using
the snapshot for backup. A dtest PR is sent to the dtest
repository.

Fixes #5658

Original PR #5805

Tests: (database_test) (dtest snapshot_test.py:TestSnapshot.test_cleaning_snapshot_by_cf)

* amnonh/delete_table_snapshot:
  test/boost/database_test: adopt new clear_snapshot signature
  api/storage_service: Support specifying a table when deleting a snapshot
  storage_service: Add optional table name to clear snapshot

* amnonh/delete_table_snapshot:
  test/boost/database_test: adopt new clear_snapshot signature
  api/storage_service: Support specifying a table when deleting a snapshot
  storage_service: Add optional table name to clear snapshot
2020-02-24 09:38:57 +01:00
Pekka Enberg
263261fa15 README: Remove out-of-date package build instructions
The package build instructions in README.md are out-of-date so let's
remove them.

Message-Id: <20200224064632.3285-1-penberg@scylladb.com>
2020-02-24 10:25:07 +02:00
Pekka Enberg
684e4602dc redis: Fix DB index error message
The error message (silently) changed to "DB index is out of range" the
following commit:

 c7a4e694ad

The new error message is part of Redis 4.0, released in 2017, so let's
switch Scylla to use the new one.

Message-Id: <20200211133946.746-1-penberg@scylladb.com>
2020-02-24 10:22:27 +02:00
Pavel Emelyanov
60bdf0685c cql3: Clean cql3/ from remaining storage_service mentionings
These are several #include-s and the no longer valid comment.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-24 11:17:47 +03:00
Pavel Emelyanov
d639d4ed5f cql3: Parse cf name in drop_index_satement::validate
The patch 759752947b explains why the .column_family method
of this statament implementation must be tuned to calculate
the column_family in some cases. However, to do this the global
storage_proxy is needed.

The proposal is to calculate the column_family in .validate
method, like it's done e.g. for function_statement-s, which
has storage_proxy reference at hands.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-24 11:17:47 +03:00
Pavel Emelyanov
a0a0d40267 cql3: Use proxy arg in batch_statement::verify_batch_size
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-24 11:17:47 +03:00
Pavel Emelyanov
bf7004326e cql3: Use proxy arg in drop_index_statement::lookup_indexed_table
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-24 11:17:47 +03:00
Pavel Emelyanov
9bb67b5771 cql3: Don't get global storage_proxy
Get rid of numerous calls to get_local_stroage_proxy().get_db()
and use the storage proxy argument that's already avaliable in
most of them.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-24 11:17:47 +03:00
Pavel Emelyanov
6892dbdde7 cql3: Add storage_proxy argument to .check_access method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-24 11:17:19 +03:00
Asias He
f4b4192c91 storage_service: Enable node repair based ops for removenode
- Removenode operation

It is used to remove a dead node out of the cluster. Existing nodes
pulls data from other existing nodes for the new ranges it own. It
pulls from one of the replicas which might not be the latest copy.

Fixes: #3003
Fixes: #4208
Tests: update_cluster_layout_tests.py + replace_address_test.py + manual test
2020-02-24 11:11:41 +08:00
Asias He
cf0601735e storage_service: Enable node repair based ops for rebuild
- Rebuild operation

It is used to get all the data this node owns form other nodes. It
pulls data from only one of the replicas which might not be the
latest copy.

Fixes: #3003
Fixes: #4208
Tests: update_cluster_layout_tests.py + replace_address_test.py + manual test
2020-02-24 11:11:41 +08:00
Asias He
3b64b4bb17 storage_service: Use the same tokens as previous bootstrap
With repair based node operations, we can resume previous failed
bootstrap. In order to do that, we need the bootstrap node uses the same
tokens as previous bootstrap.

Currently, we always use new tokens when we bootstrap, because we need
to stream all the ranges anyway. It does not matter if we use the same
tokens or not.
2020-02-24 11:11:41 +08:00
Asias He
a4c614914a storage_service: Add is_repair_based_node_ops_enabled helper
It is used to check if repair based node operations are enabled or not.
2020-02-24 11:11:40 +08:00
Asias He
cb4045e11d config: Add enable_repair_based_node_ops
An option to enable the repair based node operations.
2020-02-24 11:11:40 +08:00
Asias He
1672f64add repair: Add replace_with_repair
It is used to replace a dead node using repair instead of using
stream_plan.
2020-02-24 11:11:40 +08:00
Asias He
960ce7ab54 repair: Add rebuild_with_repair
It is used to rebuild a node using repair instead of using
stream_plan.
2020-02-24 11:11:40 +08:00
Asias He
b488ab7d11 repair: Add do_rebuild_replace_with_repair
The rebuild and replace operations are similar because the token ring
does not change for both of them. Add a common helper to do rebuild and
replace with repair. It will be used by rebuild and replace operation
shortly.
2020-02-24 11:11:40 +08:00
Asias He
b18e078ca2 repair: Add removenode_with_repair
It is used to remove a dead node from a cluster using repair instead of
using stream_plan.
2020-02-24 11:11:40 +08:00
Asias He
e9a9fde1f7 repair: Add decommission_with_repair
It is used to decommission a node using repair instead of using
stream_plan.
2020-02-24 11:11:40 +08:00
Asias He
569c126a84 repair: Add do_decommission_removenode_with_repair
It will be used by decommission and removenode operation shortly.
2020-02-24 11:11:40 +08:00
Asias He
9c67389cc8 repair: Add bootstrap_with_repair
It is used to bootstrap a node using repair instead of using
stream_plan.
2020-02-24 11:11:40 +08:00
Asias He
198cad6179 repair: Introduce sync_data_using_repair
It is used to sync data for node operations like bootstrap, decommission
and so on.

Unlike plain repair operation, the user of sync_data_with_repair() can
pass repair_neighbors object to specify the pre-calculated neighbors for
a range. If a mandatory neighbor is not available, the repair will fail
so that the upper layer can fail the node operation.
2020-02-24 11:11:40 +08:00
Asias He
1038e375af repair: Propagate exception in tracker::run
In sync_data_with_repair, we depends on return future of tracker::run to
tell if the repair is successful or not.
2020-02-24 11:11:40 +08:00
Piotr Sarna
14dfa3c0c3 alternator: change keyspace prefix to alternator_
The original idea of prefixing alternator keyspace names with 'a#'
leveraged the fact that '#' is not a legal CQL character for keyspace
names. The idea is flawed though, since '#' proved to confuse
existing Scylla tools (e.g. nodetool).
Thus, the prefix is changed to more orthodox 'alternator_'.
It is possible to create such keyspaces with CQL as well, but then
the alternator CreateTable request would simply fail, because
the keyspace already exists, which is graceful enough.
Hiding alternator keyspaces and tables from CQL is another issue,
but there are other ways to distinguish them than a non-standard
prefix, e.g. tags.

Fixes #5883
2020-02-23 23:32:29 +02:00
Pavel Emelyanov
049b549fdc api: Register /v2/config stuff after database is started
The set_config registers lambdas that need db.local(), so
these routes must be registered after database is started.

Fixes: #5849
Tests: unit(dev), manual wget on API

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200219130654.24259-1-xemul@scylladb.com>
2020-02-23 17:09:03 +02:00
Takuya ASADA
3d1154272f dist/debian: remove unused dependencies
Since we moved relocatable package, almost all dependencies are not needed now.
2020-02-23 15:36:13 +02:00
Takuya ASADA
98c182ec67 dist/redhat: align dependencies with debian
On Debian, we don't add xfsprogs/mdadm on package dependency, install on
scylla_raid_setup script instead.
Since xfsprogs/mdadm only needed for constructing RAID, we can move
dependencies to scylla_raid_setup too.
2020-02-23 15:34:35 +02:00
Piotr Sarna
4ad577b40c alternator: add content length limit to alternator servers
This patch adds a 16MB content length limit to alternator
HTTP(S) servers. It also comes with a test, which verifies
that larger requests are refused.

Fixes #5832

Tests: alternator-test(local,remote)

Message-Id: <29d5708f4bf9f41883d33d21b9cca72b05170e6c.1582285070.git.sarna@scylladb.com>
2020-02-23 14:34:20 +02:00
Piotr Sarna
085cd857ab alternator-test: limit the number of retries to 3
In order to decrease the developer's time spent on waiting
for boto3 to retry the request many times, the retry count
is configured to be 3.
Two major benefits:
 - vastly decrease wait time when debugging a failing test
 - for requests which are expected to fail, but return results
   not compatible with boto3, execution time is decreased

Tests: alternator-test(local,remote)

Message-Id: <46a3a9344d9427df7ea55c855f32b8f0e39c9b79.1582285070.git.sarna@scylladb.com>
2020-02-23 14:19:38 +02:00
Pavel Emelyanov
f4e789a9c2 range_streamer: Fix off-by-size in stream progress log
The nr_ranges_streamed denotes the number of ranges streamed
so far, but by the time the sending lambda is called this
counter is already incremented by the number of ranges to be
streamed in this call. And the variable is not used for
anything else but logging.

Fix this by swapping logging with incrementing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200221101601.18779-1-xemul@scylladb.com>
2020-02-23 11:20:17 +02:00
Tomasz Grabiec
3e83d30daf gdb: scylla sstables: Fix for older versions of GDB
Some GDB versions complain about subscript being a gdb.Value

Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1582308177-24893-1-git-send-email-tgrabiec@scylladb.com>
2020-02-23 11:17:20 +02:00
Tomasz Grabiec
e7dece7f1e gdb: scylla sstables: Allow locating sstables attached to tables
This patch adds an alternative way to locate sstables by looking at
sstable sets in table objects:

  scylla sstables -t

This may be useful for several things. One is to identify sstables
which are not attached to tables.

Another use case is to be able to use the command on older versions of
scylla which don't have sstable tracking.

Message-Id: <1582308099-24563-1-git-send-email-tgrabiec@scylladb.com>
2020-02-23 11:16:20 +02:00
Piotr Sarna
e1ecd0d637 doc: refer to dev build mode instead of release
The paragraph about adding `Tests:` footer imply that it's preferred
to run tests in release mode, while dev is equally good and compiles
faster.

Message-Id: <9e1ad1a4e1529d30abb3adb1923b007c52ccf955.1582282066.git.sarna@scylladb.com>
2020-02-23 11:11:44 +02:00
Rafael Ávila de Espíndola
fc018a73bb build: Add the --enable-stack-guards and --disable-stack-guards options
I neither is used, we get the default behavior: only release is built
without stack guards.

With --disable-stack-guards all modes are built without stack guards.

With --enable-stack-guards all modes are built with stack guards.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200222012732.992380-1-espindola@scylladb.com>
2020-02-23 11:05:13 +02:00
Avi Kivity
197adf4c0d Update seastar submodule
* seastar cdda3051e3...8b6bc659c7 (2):
  > core/file-types.hh: Fix missing header
  > cmake: Add a Seastar_STACK_GUARDS cmake option
2020-02-23 11:03:59 +02:00
Tomasz Grabiec
3a4597f8f3 Merge remote-tracking branch 'xemul/br-repair-remove-storage-service' into next 2020-02-23 10:29:34 +02:00
Pavel Emelyanov
897bbeabea storage_service: Relax _is_bootstrap_mode
The variable in question was used to check that the bootstrap mode
finishes correctly, but it was removed, becase this check was for
self-evident code and thus useless (dbca327b)

Later, the patch was reverted to keep track the bootstrap mode for
API is_cleanup_allowed call (a39c8d0e)

This patch is a reworked combination of both -- the variable is
kept for API sake, but in a much simpler manner.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200221101813.18945-1-xemul@scylladb.com>
2020-02-23 10:26:50 +02:00
Pavel Emelyanov
a364190700 storage_service: Remove if-0-ed-out Java code
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200221101704.18868-1-xemul@scylladb.com>
2020-02-23 10:26:50 +02:00
Pavel Emelyanov
38143a76c7 main: Register stop_gossiping earlier
The _scheduled_gossip_task timer needs token_metadata and thus should
be stopped before. However, this is not always the case.

The timer is armed in start_gossiping, which is called by storage_service
init_server_without_the_messaging_service_part, and is canceled inside
stop_gossiping, which in turn is called by drain_on_shutdown, which in
turn is registered too late.

If something fails between the internals of the init_server_... and
defered registration of drain_on_shutdown (lots of reasons) the timer is
not stopped and may run, thus accessing the freed token_metadata.

Bandaid this by scheduling stop_gossiping right after the gossiper
instances are created. This can be too early (before storage_service
starts gossiping) or too late (after drain_on_shutdown stops it), but
this function is re-entrable.

Fixes #5844

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200221085226.16494-1-xemul@scylladb.com>
2020-02-23 10:26:50 +02:00
Pavel Emelyanov
72a6d38e6c storage_service: Merge identical branches
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200210185011.25244-1-xemul@scylladb.com>
2020-02-23 10:26:49 +02:00
Piotr Sarna
dae86849a2 Update seastar submodule
* seastar 2b510220...cdda3051 (10):
  > core: discard unused variable / function
  > pollable_fd: use boost::intrusive_ptr rather than std::unique_ptr for lifecycle management
  > build: check for pthread_setname_np()
  > build: link against Threads::Threads
  > future: Avoid recursion in do_for_each
  > future: Expand description of parallel_for_each
  > merge: Add content length limit to httpd
  > tests/scheduling_group_test: verify current scheduling group is inherited as expected
  > net: return future<> instead of subscription<>
  > cmake: be more verbose when looking for libraries
2020-02-23 10:26:49 +02:00
guy9
a7586c6f7d added training section to readme file 2020-02-21 11:36:18 +01:00
Nadav Har'El
e8cbbba653 alternator: partial implementation of ReturnValues parameter
Before this patch, we only supported the ReturnValues=NONE setting of the
PutItem, UpdateItem and DeleteItem operations.

This patch also adds full support for the ReturnValues=ALL_OLD option
in all three operation. This option directs Alternator to return the full
old (i.e., pre-modification) contents of the item.

We implement this as a RMW (read-modify-write) operation just as we do
other RMW operations - i.e., by default we use LWT, to ensure that we really
return the value of the item directly before the modification, the same
value that would have been used in a conditional expression if there was one.

NOTE: This implementation means one cannot use ReturnValues=ALL_OLD in
forbid_rmw write isolation mode. One may theorize that if we only need the
read-before-write for ReturnValues and not for a conditional expression,
it should have been enough to use a separate read (as we do in unsafe_rmw
isolation mode) before the write. But we don't have this "optimization" yet
and I'm not sure it's a valid optimization at all - see discussion in
a new issue #5851.

This patch completes the ReturnValues support for the PutItem and DeleteItem
operations. However, the third operation, UpdateItem, supports three more
ReturnValues modes: UPDATED_OLD, ALL_NEW and UPDATED_NEW. We do not yet
support those in this patch. If a user tries to use one of these three modes,
an informative error message will be returned. The three tests for these
three unimplemented settings continue to xfail, but the rest of the tests
in test_returnvalues.py (except one test of nested attribute paths) now
pass so their xfail flag is dropped.

Refs #5053

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200219135658.7158-1-nyh@scylladb.com>
2020-02-21 08:32:47 +01:00
Tomasz Grabiec
d0b6be0820 Merge "Don't return stale data by properly invalidating row cache after cleanup" from Raphael
Row cache needs to be invalidated whenever data in sstables
changes. Cleanup removes data from sstables which doesn't belong to
the node anymore, which means cache must be invalidated on cleanup.
Currently, stale data can be returned when a node re-owns ranges which
data are still stored in the node's row cache, because cleanup didn't
invalidate the cache."

Fixes #4446.

tests:
- unit tests (dev mode)
- dtests:
    update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_decommission_node_2_test
    cleanup_test.py
2020-02-20 18:20:56 +01:00
Pavel Solodovnikov
8efb02146f cql3: const cleanups and API de-pointerization
* Pass raw::select_statement::parameters as lw_shared_ptr
 * Some more const cleanups here and there
 * lists,maps,sets::equals now accept const-ref to *_type_impl
   instead of shared_ptr
 * Remove unused `get_column_for_condition` from modification_statement.hh
 * More methods now accept const-refs instead of shared_ptr

Every call site where a shared_ptr was required as an argument
has been inspected to be sure that no dangling references are
possible.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200220153204.279940-1-pa.solodovnikov@scylladb.com>
2020-02-20 18:14:49 +02:00
Gleb Natapov
df2f67626b commitlog: fix size of a write used to zero a segment
Due to a bug the entire segment is written in one huge write of 32Mb.
The idea was to split it to writes of 128K, so fix it.

Fixes #5857

Message-Id: <20200220102939.30769-1-gleb@scylladb.com>
2020-02-20 17:22:21 +02:00
Gleb Natapov
6a78cc9e31 commitlog: use commitlog IO scheduling class for segment zeroing
There may be other commitlog writes waiting for zeroing to complete, so
not using proper scheduling class causes priority inversion.

Fixes #5858.

Message-Id: <20200220102939.30769-2-gleb@scylladb.com>
2020-02-20 17:15:13 +02:00
Raphael S. Carvalho
f93912f344 Revert "Revert "streaming: Do not invalidate cache if no sstable is added in flush_streaming_mutations""
With #4446 fixed, this commit can be reverted.

This reverts commit 454e7e0109.
2020-02-20 10:55:50 -03:00
Raphael S. Carvalho
fb81f2aa7c table: Fix stale data being returned due to lack of cache invalidation
Row cache needs to be invalidated whenever data in sstables changes. Cleanup removes
data from sstables which doesn't belong to the node anymore, which means cache must
be invalidated on cleanup.
Currently, stale data can be returned when a node re-owns ranges which data are still
stored in the node's row cache, because cleanup didn't invalidate the cache.

To prevent data that belongs to the node from being purged from the row cache, cleanup
will only invalidate the cache with a set of token ranges that will not overlap with
any of ranges owned by the node.

update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_decommission_node_2_test
now passes.

Fixes #4446.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-02-20 10:55:50 -03:00
Raphael S. Carvalho
e81076b01c compaction: Implement ranges for cache invalidation on behalf of cleanup
This procedure will calculate ranges for cache invalidation by subtracting
all owned ranges from the sstables' partition ranges. That's done so as
to reduce the size of invalidated ranges.

Refs #4446.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-02-20 10:55:49 -03:00
Raphael S. Carvalho
56f66cff9f dht: Extract to_partition_ranges() from streaming to allow reuse
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-02-20 10:53:01 -03:00
Piotr Sarna
cbe6f260ef alternator: add guarding stack height for JSON parsing
In order to avoid stack overflow issues represented by the attached
test case, rapidjson's parser now has a limit of nested level.
Previous iterations of this patch used iterative parsing
provided by rapidjson, but that solution has two main flaws:
1. While parsing can be done iteratively, printing the document
   is based on a recursive algorithm, which makes the iteratively
   parsed JSON still prone to stack overflow on reads.
   Documents with depth 35k were already prone to that.
2. Even if reading the document would have been performed iteratively,
   its destruction is stack-based as well - the chain of C++ destructors
   is called. This error is sneaky, because it only shows with depths
   around 100k with my local configuration, but it's just as dangerous.

Long story short, capping the depth of the object to an arguably large
value (39) was introduced to prevent stack overflows. Real life
objects are expected to rarely have depth of 10, so 39 sounds like
a safe value both for the clients and for the stack.
DynamoDB has a nesting limit of 32.

Fixes #5842
Tests: alternator-test(local,remote)
Message-Id: <b083bacf9df091cc97e4a9569aad415cf6560daa.1582194420.git.sarna@scylladb.com>
2020-02-20 13:05:58 +02:00
Piotr Dulikowski
82a2bdf39f cdc: distinguish open and closed ranges for range delete
This patch causes inclusive and exclusive range deletes to be
distinguished in cdc log. Previously, operations `range_delete_start`
and `range_delete_end` were used for both inclusive and exclusive bounds
in range deletes. Now, old operations were renamed to
`range_delete_*_inclusive`, and for exclusive deletes, new operations
`range_delete_*_exclusive` are used.

Tests: unit(dev)
2020-02-20 11:39:06 +01:00
Asias He
62774ff882 gossiper: Always use the new generation number
User reported an issue that after a node restart, the restarted node
is marked as DOWN by other nodes in the cluster while the node is up
and running normally.

Consier the following:

- n1, n2, n3 in the cluster
- n3 shutdown itself
- n3 send shutdown verb to n1 and n2
- n1 and n2 set n3 in SHUTDOWN status and force the heartbeat version to
  INT_MAX
- n3 restarts
- n3 sends gossip shadow rounds to n1 and n2, in
  storage_service::prepare_to_join,
- n3 receives response from n1, in gossiper::handle_ack_msg, since
  _enabled = false and _in_shadow_round == false, n3 will apply the
  application state in fiber1, filber 1 finishes faster filber 2, it
  sets _in_shadow_round = false
- n3 receives response from n2, in gossiper::handle_ack_msg, since
  _enabled = false and _in_shadow_round == false, n3 will apply the
  application state in fiber2, filber 2 yields
- n3 finishes the shadow round and continues
- n3 resets gossip endpoint_state_map with
  gossiper.reset_endpoint_state_map()
- n3 resumes fiber 2, apply application state about n3 into
  endpoint_state_map, at this point endpoint_state_map contains
  information including n3 itself from n2.
- n3 calls gossiper.start_gossiping(generation_number, app_states, ...)
  with new generation number generated correctly in
  storage_service::prepare_to_join, but in
  maybe_initialize_local_state(generation_nbr), it will not set new
  generation and heartbeat if the endpoint_state_map contains itself
- n3 continues with the old generation and heartbeat learned in fiber 2
- n3 continues the gossip loop, in gossiper::run,
  hbs.update_heart_beat() the heartbeat is set to the number starting
  from 0.
- n1 and n2 will not get update from n3 because they use the same
  generation number but n1 and n2 has larger heartbeat version
- n1 and n2 will mark n3 as down even if n3 is alive.

To fix, always use the the new generation number.

Fixes: #5800
Backports: 3.0 3.1 3.2
2020-02-20 11:20:20 +01:00
Dejan Mircevski
8393ee2e54 cql3: Permit views sync when a table is modified
Previously we required MODIFY permissions on all materialized views in
order to modify a table.  This is wrong, because the views should be
synced to the table unconditionally.  For the same reason,
users *shouldn't* be granted MODIFY on views, to prevent them manually
changing (and breaking) a view.

This patch removes an explicit permissions check in
modification_statement introduced by 65535b3.  It also tests that a
user can indeed modify a table they are allowed to modify, regardless
of lacking permissions on the table's views and indices.

Fixes #5205.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-02-20 10:43:41 +01:00
Avi Kivity
4cc7f7e2af Merge "Log CQL queries under "trace" level" from Kostja
"
This series ensures the server more often than not initializes
raw_cql_statement, a variable responsible for holding the original
CQL query, and adds logging events to all places executing CQL,
and logs CQL text in them.

A prepared statement object is the third incarnation of
parser output in Scylla:
- first, we create a parsed_statement descendent.
This has ~20 call sites inside Cql.g
- then, we create a cql_statement descendent, at ~another 20 call sites
- finally, in ~5 call sites we create a prepared statement object,
wrapping cql_statement. Sometimes we use cql_statement object
without a prepared statement object (e.g. BATCHes).

Ideally we'd want to capture the CQL text right in the parser, but
due to complicated transformations above that would require
patching dozens of call sites.

This series moves raw_cql_statement from class prepared_statement
to its nested object, cql_statement, batches, and initializes this
variable in all major call sites. View prepared statements and
some internal DDL statements still skip setting it.
"

* 'query_processor_trace_cql_v2' of https://github.com/kostja/scylla:
  query_processor: add CQL logging to all major execute call sites.
  query_procesor: move raw_cql_statement to cql_statement
  query_processor: set raw_cql_statement consistently
2020-02-20 11:07:52 +02:00
Nadav Har'El
7d545078ca docs/alternator: remove incorrect comment on BatchWriteItem
In the state of Alternator in docs/alternator/alternator.md, we said that
BatchWriteItem doesn't check for duplicate entries. That is not true -
we do - and we even have tests (test_batch_write_duplicate*) to verify that.
So drop that comment.

Refs #5698. (there is still a small bug in the duplicate checking, so still
leaving that issue open).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200219164107.14716-1-nyh@scylladb.com>
2020-02-20 08:11:31 +01:00
Nadav Har'El
b8aed18a24 alternator: unzero "scylla_alternator_total_operations" metric
In commit 388b492040, which was only supposed
to move around code, we accidentally lost the line which does

    _executor.local()._stats.total_operations++;

So after this commit this counter was always zero...
This patch returns the line incrementing this counter.

Arguably, this counter is not very important - a user can also calculate
this number by summing up all the counters in the scylla_alternator_operation
array (these are counters for individual types of operations). Nevertheless,
as long as we do export a "scylla_alternator_total_operations" metric,
we need to correctly calculate it and can't leave it zero :-)

Fixes #5836

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200219162820.14205-1-nyh@scylladb.com>
2020-02-20 08:11:15 +01:00
Raphael S. Carvalho
db4c3230f7 compaction: Add ranges for cache invalidation to compaction_completion_desc
It will store the ranges to be invalidated in row cache on compaction
completion. Intended to be used by cleanup compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-02-19 19:30:35 -03:00
Raphael S. Carvalho
51532b84f8 compaction: Make it possible for a compaction type to customize compaction_completion_desc
compaction_completion_desc will eventually store more information that can be
customized by the compaction type.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-02-19 19:30:35 -03:00
Raphael S. Carvalho
fa16845353 database: Fix on_compaction_completion doc
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-02-19 19:30:34 -03:00
Raphael S. Carvalho
65b4fc8bcd sstables/compaction: Introduce compaction_completion_desc
This descriptor contain all information needed for table to be properly
updated on compaction completion. A new member will be added to it soon,
which will store ranges to be invalidated in row cache on behalf of
cleanup compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-02-19 19:29:32 -03:00
Piotr Sarna
4e95b67501 Merge 'cql3: do_execute_base_query: fix null deref ...
... when clustering key is unavailable' from Benny

This series fixes null pointer dereference seen in #5794

efd7efe cql3: generate_base_key_from_index_pk; support optional index_ck
7af1f9e cql3: do_execute_base_query: generate open-ended slice when clustering key is unavailable
7fe1a9e cql3: do_execute_base_query: fixup indentation

Fixes #5794

Branches: 3.3

Test: unit(dev) secondary_indexes_test:TestSecondaryIndexes.test_truncate_base(debug)

* bhalevy/fix-5794-generate_base_key_from_index_pk:
  cql3: do_execute_base_query: fixup indentation
  cql3: do_execute_base_query: generate open-ended slice when clustering key is unavailable
  cql3: generate_base_key_from_index_pk; support optional index_ck
2020-02-19 13:30:30 +01:00
Tomasz Grabiec
884d5e2bcb Merge "Fix use-after-frees in migration_manager and feature_service" from Pavel
There has been recently discussed several problems when stopping
migration manager and features.

The first issue is with migration manager's schema pull sleeping
and potentially using freed migration manager instances.

Two others are with freeing database and migration manager before
features they wait for are enabled.
2020-02-19 13:02:35 +01:00
Piotr Sarna
3315220aea alternator: fix server when no authorization header is found
A typo caused the code to check for wrong header and assume
that Authorization header exists, even if it was not the case.
The fix comes with a regression test.
Message-Id: <58070abddae6359212aa399688e3e2704d52f419.1582108625.git.sarna@scylladb.com>
2020-02-19 13:39:50 +02:00
Benny Halevy
7fe1a9ec4a cql3: do_execute_base_query: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-02-19 13:31:18 +02:00
Benny Halevy
7af1f9e26a cql3: do_execute_base_query: generate open-ended slice when clustering key is unavailable
1. Only call base_ck = generate_base_key_from_index_pk<...
   if the base schema has a clustering key.
2. Only call command->slice.set_range(*_schema, base_pk, ...
   if the base schema has a clustering key,
   otherwise just create an open ended range.

Proposed-by: Piotr Sarna <sarna@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-02-19 13:30:37 +02:00
Piotr Sarna
5f0d77b9a4 Merge 'mv: drop materialized views before its table' from Eliran
When dropping a table, the table and its views are dropped
in parallel, this is not a problem as for itself but we
have mechanism to snapshot a deleted table before the
actual delete. When a secondary index is removed, in the
snapshot process it looks for it's schema for creating the
schema part of the snapshot but if the main table is already
gone it will not find it.
This commit serializes views and main table removals and
removes the views prior to the tables.

See discussion on #5713

Tests:
Unit tests (dev)
dtest - A test that failed on "can't find schema" error

Fixes #5614

* eliran/serialize_table_views_deletion:
  Materialized Views: serialize tables and views creation
  Materialized Views: drop materialized views before tables
2020-02-19 12:20:20 +01:00
Pavel Emelyanov
8435e93549 db: Move unbounded_range_tombstones listening from storage_service
Now the database keeps reference on feature service, so we
can listen on the feature in it directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-19 14:08:24 +03:00
Pavel Emelyanov
7aa7e4f550 migration_manager: Abort and wait cluster upgrade waiters
The maybe_schedule_schema_pull waits for schema_tables_v3 to
become available. This is unsafe in case migration manager
goes away before the feature is enabled.

Fix this by subscribing on feature with feature::listener and
waiting for condition variable in maybe_schedule_schema_pull.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-19 14:08:24 +03:00
Nadav Har'El
405115fa5f alternator: cleanup of get_string_attribute() function
The get_string_attribute() function used attribute_value->GetString()
to return an std::string. But this function does not actually return a
std::string - it returns a char*, which gets implicitly converted to
an std::string by looking for the first null character. This lookup is
unnecessary, because rjson already knows the length of the string, and
we can use it.

This patch is just a cleanup and a very small performance improvement -
I do not expect it fixes any bugs or changes anything functional, because
JSON strings anyway cannot contain verbatim nulls.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200219101159.26717-1-nyh@scylladb.com>
2020-02-19 11:59:54 +01:00
Benny Halevy
efd7efe41e cql3: generate_base_key_from_index_pk; support optional index_ck
When called from indexed_table_select_statement::do_execute_base_query,
old_paging_state->get_clustering_key() may return un-engaged
optional<clustering_key>. Dereferencing it unconditionally crashes
scylla as seen in https://github.com/scylladb/scylla/issues/5794

Fixes #5794

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-02-19 12:13:08 +02:00
Pavel Emelyanov
08363e5034 migration_manager: Abort and wait delayed schema pulls
The sleep is interrupted with the abort source, the "wait" part
is done with the existing _background_tasks gate. Also we need
to make sure the gate stays alive till the end of the function,
so make use of the async_sharded_service (migration manager is
already such).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-19 11:55:27 +03:00
Eliran Sinvani
95724e1a66 Materialized Views: serialize tables and views creation
This change serializes tables and views creation. The
changes purpose is to avoid future possible races due to
a view searching for its base table information while the
later haven't been created yet.
2020-02-19 10:51:49 +02:00
Eliran Sinvani
923a46030b Materialized Views: drop materialized views before tables
When dropping a table, the table and its views are dropped
in parallel, this is not a problem as for itself but we
have mechanism to snapshot a deleted table before the
actual delete. When a secondary index is removed, in the
snapshot process it looks for its schema for creating the
schema part of the snapshot but if the main table is already
gone it will not find it.
This commit serializes views and main table removals and
removes the views prior to the tables.

See discussion on https://github.com/scylladb/scylla/pull/5713

Tests:
Unit tests (dev)
dtest - A test that failed on "can't find schema" error

Fixes #5614
2020-02-19 10:48:11 +02:00
Pavel Solodovnikov
a46f235092 cql3: prefer passing schema as const ref instead of shared_ptr
De-pointerize cql3 code APIs further: change some call sites
to pass `schema` as const-ref instead of `shared_ptr`.

Affected functions known to be expecting always non-null
pointer to schema and don't store or pass the pointer somewhere
else, assuming it's safe to give them just a reference.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200218142338.69824-1-pa.solodovnikov@scylladb.com>
2020-02-18 20:13:10 +02:00
Piotr Dulikowski
4343471954 hh: handle counter update hints correctly
This patch fixes a bug that appears because of an incorrect interaction
between counters and hinted handoff.

When a counter is updated on the leader, it sends mutations to other
replicas that contain all counter shards from the leader. If
consistency level is achieved but some replicas are unavailable, a hint
with mutation containing counter shards is stored.

When a hint's destination node is no longer its replica, it is
attempted to be sent to all its current replicas. Previously, if the
cluster did not have the feature HINTED_HANDOFF_SEPARATE_CONNECTION
enabled, storage_proxy::mutate function would be used for the purpose of
sending the hint. It was incorrect because that function treats
mutations for counter tables as mutations containing only a delta (by
how much to increase/decrease the counter). These two types of mutations
have different serialization format, so in this case a "shards" mutation
is reinterpreted as "delta" mutation, which can cause data corruption to
occur.

This patch fixes the case when HINTED_HANDOFF_SEPARATE_CONNECTION is
disabled, and uses storage_proxy::mutate_internal, which treats "shards"
mutation as regular mutations - which is the correct behavior.

Refs #5833.
Tests: unit(dev)
2020-02-18 20:13:10 +02:00
Avi Kivity
454e7e0109 Revert "streaming: Do not invalidate cache if no sstable is added in flush_streaming_mutations"
This reverts commit 5e9925b9f0. It causes
data resurrection in simple_decommission_node_2_test.

Fixes #5838.
2020-02-18 20:13:10 +02:00
Calle Wilund
d7a9fc3611 db::config: Adjust truncation timeout to match value in yaml example
Refs #817

Truncation is potentially long. It has its own timeout in storage
proxy/rpc. This value should probably also be higher than default
timeout.

Message-Id: <20200218135926.26522-1-calle@scylladb.com>
2020-02-18 20:13:10 +02:00
Amnon Heiman
30a7587963 test/boost/database_test: adopt new clear_snapshot signature
The clear_snapshot method signature was modified and accept a table name
parameter.

This patch adds an empty table name to the clear_snapshot test so it
would compile and pass.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-02-18 16:50:58 +02:00
Amnon Heiman
6b020e67ce api/storage_service: Support specifying a table when deleting a snapshot
This patch adds an optional parameter to DELETE /storage_service/snapshots

After this patch the following will be supported:

If a keyspace called keyspace1 and a table called standard1 exists.

curl -X POST 'http://localhost:10000/storage_service/snapshots?tag=am1&kn=keyspace1'

curl -X DELETE --header 'Accept: application/json' 'http://localhost:10000/storage_service/snapshots?tag=am1&kn=keyspace1&cf=standard1'

Fixes #5658

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-02-18 16:34:10 +02:00
Amnon Heiman
c3260bad25 storage_service: Add optional table name to clear snapshot
There are cases when it is useful to delete specific table from a
snapshot.

An example is when a snapshot is used for backup. Backup can take a long
period of time, during that time, each of the tables can be deleted once
it was backup without waiting for the entire backup process to
completed.

This patch adds such an option to the database and to the storage_service
wrapping method that calls it.

If a table is specified a filter function is created that filter only
the column family with that given name.

This is similar to the filtering at the keyspace level.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-02-18 16:34:10 +02:00
Nadav Har'El
e50e8a8432 alternator-test: improve ReturnValues tests
This patch adds additional tests for the ReturnValues feature to make the
test even more comprehensive. As this feature is not yet implemented in
Alternator (see issue #5053), all tests XFAIL on Alternator - except two
tests for the trivial "NONE" mode which is already supported. As usual
all tests pass on DynamoDB.

This patch also splits the tests for the ReturnValues parameter in the
UpdateItem operation into multiple tests, each testing one of the different
modes which DynamoDB supports - NONE, ALL_OLD, UPDATED_OLD, ALL_NEW and
UPDATED_NEW. The separate tests will be useful if we implement this feature
incrementally - so the separate modes can be tested separately.

Refs #5053.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200218085618.5584-1-nyh@scylladb.com>
2020-02-18 16:16:20 +02:00
Alejo Sanchez
45a6cc5d53 cql3: single metric for range scan and full scan
Combining both range and full table scans in a single metric as
"partition range scans are used to implement full scans in scylla deployments."
Requested by @bdenes and @avi

Refs: #5209

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20200211101221.690031-2-alejo.sanchez@scylladb.com>
2020-02-18 16:16:20 +02:00
Nadav Har'El
c8348bccc9 docs: new document about protocols and ports in Scylla
This patch adds a new document, docs/protocols.md, about all the different
protocols which Scylla supports - and the different ports which they use.
This includes Scylla's internal protocol, user-facing protocols (CQL, Thrift,
DynamoDB, Redis, JMX) and things inbetween (REST API, Prometheus).

I wrote this document after being frustrated that when I see a port number
(e.g., "7000") or a port option name (e.g., "storage_port") it's hard to
figure out what they actually are - or why they are given such strange
names. The intention is that this file can easily be searched for option
names, for more familiar names (e.g., "CQL"), and a reader can get the
whole story - including some pointers to relevant part of the code (this
part of the document can be improved further - in this version this only
exists for the internal protocol).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200217172049.25510-1-nyh@scylladb.com>
2020-02-18 16:16:20 +02:00
Avi Kivity
fe71ed5f82 Update seastar submodule
* seastar c7c249f67d...2b51022073 (8):
  > dns_test: Test with seastar.io instead of www.google.com
  > sharded: fix move constructor for peering_sharded_service
Fixes #5814.
  > tests: Delete Seastar.dist
  > reactor: distinguish structs from classes when befriending
  > util/tuple_utils.hh: avoid redundant move
  > io_request: do not include fmt/format.h
  > reactor: cleanup write_some leftover
  > posix: change the signature of accept/try_accept
2020-02-18 16:16:19 +02:00
Avi Kivity
6c7aa18238 Merge "Introduce schema::get_partitioner" from Piotr
"
Introduce schema::get_partitioner and use it instead of dht::global_partitioner.

Fixes #5493

Tests: unit(dev, release, debug)
"

* 'per_table_partitioner_prep' of https://github.com/haaawk/scylla: (35 commits)
  cdc: stop using partitioners
  partitioner_test: stop calling set_global_partitioner
  storage_service: stop calling global_partitioner()
  mutation_writer_test: stop calling global_partitioner()
  schema: reduce number of global_partitioner() calls
  test_services: stop calling global_partitioner()
  sstable_utils: stop calling global_partitioner()
  sstable_resharding_test: stop depending on global partitioner
  sstable_mutation_test: stop calling global_partitioner()
  sstable_data_file_test: stop calling global_partitioner()
  random_schema: stop taking partitioner in constructor
  mutation_reader_test: stop calling global_partitioner()
  multishard_mutation_query_test: stop calling global_partitioner()
  row_level repair: stop calling global_partitioner()
  distribute_reader_and_consume_on_shards: don't take partitioner
  thrift: reduce global_partitioner() calls
  binary_search: stop calling global_partitioner()
  index_entry: stop calling global_partitioner()
  mc writer: stop calling global_partitioner()
  sstable: stop calling global_partitioner()
  ...
2020-02-17 18:12:53 +02:00
Avi Kivity
06c16108df Merge "cql3: minor cleanups (de-pointerize APIs)" from Pavel
"
This change set is comprised of several unrelated patches regarding
some cleanups in cql3 layer code.

Most of the changes are aimed at eliminating superfluous `shared_ptr`
usages. In places where it can be safely assumed that objects passed
to the function are considered non-null and constant, these places
were adjusted to use passing as const ref instead.

Other changes incude eliminating unused arguments at some functions
and replacing usages of `shared_ptr<service::pager::paging_state>`
to use `lw_shared_ptr` instead, since `pager::paging_state` is final.

Tests: unit(dev, debug)
"

* 'feature/cql_cleanups_4' of https://github.com/ManManson/scylla:
  cql3: minor sweeps through the cql layer code to reduce shared_ptrs count
  cql3: change some function signatures to accept const references
  cql3: change signatures of several functions to return crefs instead of pointers
  cql3: remove unused argument at functions::castas_functions::get
  paging_state: switch from shared_ptr to lw_shared_ptr
2020-02-17 17:50:30 +02:00
Piotr Dulikowski
01084a79b8 hh: send orphaned hints on HINT_MUTATION verb
When replaying a hint with a destination node that is no longer in the
cluster, it will be sent with cl=ALL to all its new replicas. Before
this patch, the MUTATION verb was used, which causes such hints to be
handled on the same connection and with the same priority as regular
writes. This can cause problems when a large number of hints is
orphaned and they are scheduled to be sent at once. Such situation
may happen when replacing a dead node - all nodes that accumulated hints
for the dead node will now send them with cl=ALL to their new replicas.

This patch changes the verb used to send such hints to HINT_MUTATION.
This verb is handled on a separate connection and with streaming
scheduling group, which gives them similar priority to non-orphaned
hints.

Refs: #4712

Tests: unit(dev)
2020-02-17 14:45:22 +01:00
Tomasz Grabiec
76d1dd7ec6 Merge "nodetool scrub: implement validation and the skip-corrupted flag
" from Botond

Nodetool scrub rewrites all sstables, validating their data. If corrupt
data is found the scrub is aborted. If the skip-corrupted flag is set,
corrupt data is instead logged (just the keys) and skipped.

The scrubbing algorithm itself is fairly simple, especially that we
already have a mutation stream validator that we can use to validate the
data. However currently scrub is piggy-backed on top of cleanup
compaction. To implement this flag, we have to make scrub a separate
compaction type and propagate down the flag. This required some
massaging of the code:
* Add support for more than two (cleanup or not) compaction types.
* Allow passing custom options for each compaction type.
* Allow stopping a compaction without the manager retrying it later.

Additionally the validator itself needed some changes to allow different
ways to handle errors, as needed by the scrub.

Fixes: #5487

* https://github.com/denesb/nodetool-scrub-skip-corrupted/v7:
  table: cleanup_sstables(): only short-circuit on actual cleanup
  compaction: compaction_type: add Upgrade
  compaction: introduce compaction_options
  compaction: compaction_descriptor: use compaction options instead of
    cleanup flag
  compaction_manager: collect all cleanup related logic in
    perform_cleanup()
  sstables: compaction_stop_exception: add retry flag
  mutation_fragment_stream_validator: split into low-level and
    high-level API
  compaction: introduce scrub_compaction
  compaction_manager: scrub: don't piggy-back on upgrade_sstables()
  test: sstable_datafile_test: add scrub unit test
2020-02-17 15:28:07 +02:00
Piotr Jastrzebski
f0f6e220ea cdc: stop using partitioners
CDC can get all it needs from a config and does not need
partitioner.

For base table specific operations CDC is using partitioner
from that table (obtained with schema::get_partitioner).

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
c0873f9b10 partitioner_test: stop calling set_global_partitioner
All the places that use partitioner have been switched
to not use global partitioner any more and we can stop
setting it in this test.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
499e330ff9 storage_service: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
81cfc63ba6 mutation_writer_test: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
406f42e012 schema: reduce number of global_partitioner() calls
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
8a9dc8b394 test_services: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
510245f3c3 sstable_utils: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
65f8fc5a06 sstable_resharding_test: stop depending on global partitioner
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
a65f3d1f7b sstable_mutation_test: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
aae6240273 sstable_data_file_test: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
a18c791f6f random_schema: stop taking partitioner in constructor
random_schema already has a _schema field which in turn
has a get_partitioner() function. Store partitioner
in random_schema is redundant.

At the moment all uses of random_schema are based on
default partitioner so it is not necessary to set it
explicitly. If in the future we need random_schema to
work with other partitioners we will add the constructor
back and fix the creation of _schema to contain it. It's
not needed now though.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
aeb9ea87df mutation_reader_test: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
4df60c7998 multishard_mutation_query_test: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
ef9acd9ee5 row_level repair: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
9494da2102 distribute_reader_and_consume_on_shards: don't take partitioner
This function already takes schema so it can get partitioner
using schema::get_partitioner.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
7c6f415647 thrift: reduce global_partitioner() calls
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
56e3cb8c3a binary_search: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
1db437ee91 index_entry: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
1f866d7001 mc writer: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
6fe0dcbac4 sstable: stop calling global_partitioner()
parse functions now take const schema& which allows
them to reach a partitioner. It's safe to take schema
by const& because the only caller takes the schema
from an sstable object.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
0677bafd16 multishard_mutation_query: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
76d154dbac view: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
6e424a3645 select_statement: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
2d7532f87f dht: add dht::get_token
and replace all calls to dht::global_partitioner().get_token

dht::get_token is better because it takes schema and uses it
to obtain partitioner instead of using a global partitioner.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
ca4a89d239 dht: add dht::decorate_key
and replace all dht::global_partitioner().decorate_key
with dht::decorate_key

It is an improvement because dht::decorate_key takes schema
and uses it to obtain partitioner instead of using global
partitioner as it was before.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:06 +01:00
Piotr Jastrzebski
abd76e566f dht::shard_of: stop calling global_partitioner()
Take const schema& as a parameter of shard_of and
use it to obtain partitioner instead of calling
global_partitioner().

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:23:16 +01:00
Piotr Jastrzebski
5234350df2 split_range_to_single_shard: stop calling global_partitioner()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:19:15 +01:00
Piotr Jastrzebski
24b721c21b ring_position_exponential_sharder: stop calling global_partitioner()
ring_position_exponential_sharder calls global_partitioner
in one constructor. Luckily the constructor is never used so
we can remove that constructor.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:19:15 +01:00
Piotr Jastrzebski
db19a76b1f selective_token_range_sharder: stop calling global_partitioner()
This requires a change in a repair that uses
selective_token_range_sharder.

Repair performs operation on a set of tables. We will have to
make sure that all of that tables use the same partitioner.

This is achieved by adding a check to a repair_info constructor.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:19:15 +01:00
Piotr Jastrzebski
75785ef13e i_partitioner: add operator<<
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:19:15 +01:00
Piotr Jastrzebski
065885300d i_partitioner: add == and != operators
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:19:15 +01:00
Piotr Jastrzebski
57e4b7f215 ring_position_range_sharder: stop calling global_partitioner
Remove ring_position_range_sharder(nonwrapping_range<ring_position>)
which calls another constructor with partitioner obtained with
dht::global_partitioner().

Fix all the places the removed constructor was used and obtain
partitioner from schema instead.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:19:15 +01:00
Piotr Jastrzebski
dd1120454b dht: move sharders to a separate header
i_partitioner.hh is widely included while sharders are used
only in 6 places so there's no need to include them in
the whole codebase.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:19:02 +01:00
Piotr Jastrzebski
a5b6374398 dht: remove unused ring_position_exponential_vector_sharder
The next patch is moving sharders to a separate header.
ring_position_exponential_vector_sharder is not used anywhere
so instead of just silently removing it with the move, this
commit is separated to make it clear the class is removed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:04:41 +01:00
Piotr Jastrzebski
9b95153136 schema: add get_partitioner()
The plan is to remove dht::global_partitioner()
and use schema::get_partitioner() instead.

This will allow a usage of per schema/table partitioner
instead of a single global partitioner everywhere.

Initially schema::get_partitioner will call
dht::global_partitioner. After all the calls
to dht::global_partitioner are switched to
schema::get_partitioner, the ability to set per schema
partitioner will be implemented.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:04:41 +01:00
Takuya ASADA
9a84164c95 dist: drop old distribution code
Since we dropped support of Ubuntu 14.04 and Debian 8, we can remove the code
for these distributions.
2020-02-17 10:18:35 +02:00
Avi Kivity
6728b96df7 clustering_interval_set: split to own header file
clustering_interval_set is a rarely used class, but one that requires
boost/icl, which is quite heavyweight. To speed up compilation, move
it to its own header and sprinkle #includes where needed.

Tests: unit (dev)
Message-Id: <20200214190507.1137532-1-avi@scylladb.com>
2020-02-16 17:40:47 +02:00
Nadav Har'El
51f3e7eaff merge: token_metadata: pimplify
Merged patch series from Avi Kivity:

token_metadata is a heavyweight class with heavyweight includes
(boost/icl) it is a good candidate for the pimpl pattern, which
this series implements.

Tests: unit (dev)

https://github.com/avikivity/scylla token_metadata-pimplification/v1

Avi Kivity (6):
  locator: token_metadata: use non-deduced return type for ring_range()
  locator: token_metadata: pimplify
  locator: token_metadata: make token_metadata_impl::tokens_iterator a
    non-nested class
  locator: token_metadata: pimplify tokens_iterator
  locator: token_metadata: move implementation classes to .cc
  locator: token_metadata: remove unused include "query-request.hh"

 locator/token_metadata.hh           |  783 +---------------
 locator/token_metadata.cc           | 1338 ++++++++++++++++++++++++++-
 test/boost/sstable_datafile_test.cc |    1 +
 3 files changed, 1332 insertions(+), 790 deletions(-)

Message-Id: <20200214184954.1130194-1-avi@scylladb.com>
2020-02-16 17:15:26 +02:00
Piotr Sarna
70c9889ef7 storage_proxy: remove dead metrics code
This patch removes an implementation of register_split_metrics_for,
which is not used anywhere in the codebase.

Message-Id: <e83f3e9d109113fe0553919032f005d4ab3a3023.1581851904.git.sarna@scylladb.com>
2020-02-16 17:00:45 +02:00
Nadav Har'El
e18a302c54 merge: Implement stopping alternator server
Merged patch series from Piotr Sarna:

This miniseries implements graceful shutdown for alternator
by introducing two mechanisms:
 - refusing to accept new requests during shutdown
   by stopping the HTTP/HTTPS server(s)
 - guarding pending requests with a gate, so that
   when alternator server is stopped, no in-flight
   alternator requests are being processed

Fixes #5781

Tests: manual(stopping Scylla in the middle of alternator-test
              multiple times, used to crash every time
              with local_is_initialized() assertion)

Piotr Sarna (3):
  alternator: implement stopping alternator server
  alternator: guard pending alternator requests with a gate
  alternator: guard alternator-specific handlers with a gate

 alternator/server.cc | 64 +++++++++++++++++++++++++++++++++++---------
 alternator/server.hh |  4 +++
 main.cc              | 11 ++++++--
 3 files changed, 64 insertions(+), 15 deletions(-)
2020-02-16 16:35:14 +02:00
Pavel Solodovnikov
abb3a7e218 cql3: minor sweeps through the cql layer code to reduce shared_ptrs count
Convert some more helper functions to accept const reference to
column_specification and column_identifier instead of shared_ptr.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-02-16 17:24:26 +03:00
Pavel Solodovnikov
5b6e2d7178 cql3: change some function signatures to accept const references
This patch continues the effort of reducing shared_ptr's count
in the different APIs throughout the cql3 code tree.

These functions now pass cref to column_specification instead of
shared_ptr:
 * multiple variants of `validate_assignable_to`
 * sets::value_spec_of
 * lists::value_spec_of
 * lists::index_spec_of
 * lists::uuid_index_spec_of
 * tuples::component_spec_of
 * user_types::field_spec_of

These functions don't pass the shared_ptr around down the call
hierarchy, also obviously assuming that the column_specification
passed is always non-null.

So it's safe to assume that they don't borrow the ownership of
the pointer or knowingly prolongate lifetime of the object
pointed by.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-02-16 17:24:14 +03:00
Pavel Solodovnikov
49bf936403 cql3: change signatures of several functions to return crefs instead of pointers
The following functions now accept const reference to
column_specification instead of shared_ptr:
 * lists::index_spec_of
 * lists::value_spec_of
 * lists::uuid_index_spec_of
 * sets::value_spec_of

Changed maps::value_spec_of and maps::key_spec_of signatures
to accept const ref instead of non-const ref to
column_specification.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-02-16 17:23:56 +03:00
Pavel Solodovnikov
7c05100c87 cql3: remove unused argument at functions::castas_functions::get
Remove unused `schema_ptr` argument at
`functions::castas_functions::get` function.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-02-16 17:23:46 +03:00
Pavel Solodovnikov
d64fd52ae5 paging_state: switch from shared_ptr to lw_shared_ptr
Change the way `service::pager::paging_state` is passed around
from `shared_ptr` to `lw_shared_ptr`. It's safe since
`paging_state` is final.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-02-16 17:23:36 +03:00
Piotr Sarna
626ec730c4 storage_proxy: make register_metrics_for function reentrant
Helper function for registering metrics for an endpoint,
register_metrics_for(ep) depends on an external state to be updated.
It checks if given metrics are added to a map, and if not, the metrics
are registered, but the mentioned map is expected to be updated
by the caller (e.g. get_ep_stat). This behaviour is error-prone,
because calling this function twice will result in an exception,
since registering metrics twice is not allowed.

Refs #5697
Message-Id: <5a9ddccf52861749dbda4204b5d098cc77bc51eb.1581855769.git.sarna@scylladb.com>
2020-02-16 15:43:07 +02:00
Piotr Sarna
bd888a2695 alternator: guard alternator-specific handlers with a gate
Alternator is able to serve more requests than its database operations,
e.g. a health check and returning the list of its nodes.
These operation, for safety, are no also guarded by the pending
requests gate.
2020-02-16 14:15:29 +01:00
Piotr Sarna
acfed880cc alternator: guard pending alternator requests with a gate
In order to make sure that pending alternator requests are processed
during shutdown, a gate for each shard is introduced. On shutdown,
each gate will be closed and all in-progress operations will be waited upon.

Fixes #5781
2020-02-16 13:48:45 +01:00
Piotr Sarna
c8ab9b3ae4 alternator: implement stopping alternator server
Stopping Scylla with alternator enabled is not clean,
because the server does not stop accepting requests
on shutdown, which leads to use-after-free events.
The first step towards a cleaner solution is to implement
alternator_server::stop(), which stops the HTTP/HTTPS servers.

Refs #5781
2020-02-16 13:34:21 +01:00
Nadav Har'El
70d914ad5b alternator: update docker instructions in docs/alternator/getting-started.md
The instructions in docs/alternator/getting-started.md on how to run
Alternator with docker are outdated and confusing, so this patch updates
them.

First, the instructions recommended the scylladb/scylla-nightly:alternator
tag, but we only ever created this tag once, and never updated it. Since
then, Alternator has been constantly improving, and we've caught up on
a lot of features, and people who want to test or evaluate Alternator
will most likely want to run the latest nightly build, with all the latest
Alternator features. So we update the instructions to request the latest
nightly build - and mention the need to explictly do "docker pull" (without
this step, you can find yourself running an antique nightly build, which
you downloaded months ago!). This instruction can be revisited once
Alternator is GAed and not improving quickly and we can then recommend to
run the latest stable Scylla - but I think we're not there yet.

Second, in recent builds, Alternator requires that the LWT feature is
enabled, and since LWT is still experimental, this means that one needs
to add "--experimental 1" to the "docker run" command. Without it, the
command line in getting-started.md will refuse to boot, complaining that
Alternator was enabled but LWT wasn't. So this patch adds the
"--experimental 1" in the relevant places in the text. Again, this
instruction can and should be revisited once LWT goes out of experimental
mode.

Fixes #5813

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200216113601.9535-1-nyh@scylladb.com>
2020-02-16 12:42:37 +01:00
Nadav Har'El
b01b11c1f3 alternator: implement KeyConditionExpression
This patch adds to Alternator's Query operation full support for the
KeyConditionExpression parameter - a newer syntax for specifying which
partition and which sort-key range are to be queried. The older syntax
for the same thing, "KeyConditions", was already supported by Alternator.

The patch also includes additional test cases for more corner cases
discovered during the development. After this patch, all 47 test cases
in test_key_condition_expression.py pass on Alternator (and, of course,
also on DynamoDB).

One interesting thing to note about this patch is that it does *not*
include a new parser for the KeyConditionExpression syntax. It turns out
that we need - to be fully compatible with DynamoDB - to use the
already existing parser for *ConditionExpression* syntax, and then forbid
certain things not allowed in KeyConditionExpression (you can see a lot
of examples in code comments and in the tests included in this patch).
Most importantly, allowing the full ConditionExpression syntax also
means we allow completely useless parentheses on key conditions, e.g.,
'((p=:p) AND (c=:c))'. While the KeyConditionExpression documentation
doesn't mention allowing these parentheses, DynamoDB does support them -
and it turns out that boto3 uses them when you use its condition builders,
as we do in one test case (test_query_key_condition_expression).

Fixes #5037.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200213192509.32685-4-nyh@scylladb.com>
2020-02-16 11:22:30 +02:00
Nadav Har'El
15515b2cc1 alternator: more useful get_key_from_typed_value() utility function
We had a get_key_from_typed_value() utility function to decode a
JSON-encoded value with a known type (the JSON encoding is a map whose
key is the type, the value always a string because all possible key types -
string, bytes and number, are encoded as strings).

However, the function was less useful than it could have been - it was
missing one check for a malformed object (a check which only appeared in
one of its callers), it unnecessarily received the column's expected type
(all the callers passed it the given key column's type).

The cleaned up function will be more useful for the following patch
to support KeyConditionExpression, which wants to reuse it.

While at it, this patch also uses rjson::to_string_view(it->value)
instead of the less correct it->value.GetString() (the latter relies
on null-termination, which is actually true for JSON strings, but there
is no reason to rely on it).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200213192509.32685-3-nyh@scylladb.com>
2020-02-16 11:22:30 +02:00
Nadav Har'El
1fd44a0049 alternator: extract useful function to_string_view()
conditions.cc contains a useful utility function for extracting (without
copying) a string_view from a rjson::value which is known to contain a
string. This function will be useful in more Alternator code, so let's
extract it to rjson.hh, with the name rjson::to_string_view()

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200213192509.32685-2-nyh@scylladb.com>
2020-02-16 11:22:30 +02:00
Asias He
5e9925b9f0 streaming: Do not invalidate cache if no sstable is added in flush_streaming_mutations
The table::flush_streaming_mutations is used in the days when streaming
data goes to memtable. After switching to the new streaming, data goes
to sstables directly in streaming, so the sstables generated in
table::flush_streaming_mutations will be empty.

It is unnecessary to invalidate the cache if no sstables are added. To
avoid unnecessary cache invalidating which pokes hole in the cache, skip
calling _cache.invalidate() if the sstables is empty.

The steps are:

- STREAM_MUTATION_DONE verb is sent when streaming is done with old or
  new streaming
- table::flush_streaming_mutations is called in the verb handler
- cache is invalidated for the streaming ranges

In summary, this patch will avoid a lot of cache invalidation for
streaming.

Backports: 3.0 3.1 3.2
Fixes: #5769
2020-02-16 11:22:30 +02:00
Avi Kivity
82df5dfb76 Update seastar submodule
* seastar 6d2ed8cdc...c7c249f67 (3):
  > reactor: fix issue with hrtimer completions being lost
  > Merge "refactor network and storage I/O handling in backend code" from Glauber
  > reactor: don't call set_heap_profiling_enable() if not needed
2020-02-16 11:22:30 +02:00
Piotr Sarna
84be1eb6f2 test,cdc: skip across-shard test when run with one shard
Running cdc_test binary fails with a segmentation fault
when run with --smp 1, because test_cdc_across_shards
assumes shard count to be >=2. This patch skips the test case
when run with a single shard and produces a log warning.

Message-Id: <9b00537db9419d8b7c545ce0c3b05b8285351e7d.1581600854.git.sarna@scylladb.com>
2020-02-16 11:22:30 +02:00
Gleb Natapov
ed3e423922 lwt: add counter for a case where timeout is sent prematurely
There is a case in current PAXOS implementation where timeout is
returned because the code cannot guaranty whether the value is accepted
or not in case of a contention. The counter will help to correlate this
condition with failed requests.
Message-Id: <20200211160653.30317-2-gleb@scylladb.com>
2020-02-16 11:22:30 +02:00
Gleb Natapov
7694f164c4 lwt: add more tracing to paxos stages
Message-Id: <20200211160653.30317-1-gleb@scylladb.com>
2020-02-16 11:22:30 +02:00
Pavel Solodovnikov
bf95bd0916 cql3: more functions marked as const
The following functions are now "const":
 * `term::collect_marker_specification`
 * `relation::to_term`
 * `multi_item_terminal::get_elements`
 * `raw_update::is_compatible_with`

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200213142445.35312-1-pa.solodovnikov@scylladb.com>
2020-02-16 11:22:30 +02:00
Nadav Har'El
65d0a776c2 merge: alternator: Add keyspace per table
This series implements keyspace-per-table approach for Alternator.
The changes are as follows:
 - when a table is created, its keyspace is created first
 - after table deletion, its keyspace is deleted as well;
   works with views too, since these must be deleted
   before the base table is dropped
 - instead of SimpleStrategy, network topology is used

Keyspaces are created with a prefix not legal from CQL - 'a#'.
I validated that even though not reachable via CQL, keyspaces
created with # character work well and produce correct directories,
restarts work flawlessly too.

Fixes #5611
Refs #5596

Tests: alternator(local, remote)

Piotr Sarna (3):
  alternator: switch to keyspace-per-table approach
  alternator: move to NetworkTopologyStrategy
  alternator-test: add test for recreating a table
2020-02-16 11:22:30 +02:00
Piotr Sarna
e620181832 Merge 'cdc: TTLs on CDC log cells' from Juliusz
Cells in CDC logs used to be created while completely neglecting TTLs
(the TTLs from cdc = {...'ttl':600}). This patch adds TTLs to all cells;
there are no row markers, so wee need not set TTL there.

Fixes #5688

* jul-stas/5688-set-ttl-in-cdc-log-table:
  tests/cdc: added test for TTL on log table cells
  cdc: set TTLs on CDC log cells
2020-02-16 11:22:30 +02:00
Nadav Har'El
cb8315ace8 merge: alternator: Make write isolation config less terse
Merged patch series from Piotr Sarna:

This series addresses and fixes #5758 by providing less terse
configuration for write isolation. Before the patch,
suggested values for alternator write isolation policies was one of
'f', 'a', 'o', 'u', which are not really descriptive.
The code actually checks only the first character from the tag value,
but now the input is validated to allow only specific, expressive values:
 * 'a', 'always', 'always_use_lwt' - always use LWT
 * 'o', 'only_rmw_uses_lwt' - use LWT only for requests that require
    read-before-write
 * 'f', 'forbid', 'forbid_rmw' - forbid statements that need read-before-
   write. Using such statements
   (e.g. UpdateItem with ConditionExpression) will result in an error
 * 'u', 'unsafe', 'unsafe_rmw' - (unsafe) perform read-modify-write without
   any consistency guarantees

Using other values will result in an error.
This series comes with tests and docs updates.

Fixes #5758
Tests: alternator-test(local,remote)

Piotr Sarna (5):
  alternator: move rmw_operation to a header
  alternator: add validating write_isolation tag
  alternator-test: add test for write isolation tag
  alternator-test: mark write isolation tests scylla_only
  docs: update write isolation documentation

 alternator-test/test_condition_expression.py |  10 +-
 alternator-test/test_tag.py                  |   9 +
 alternator/executor.cc                       | 163 +++++++------------
 alternator/rmw_operation.hh                  |  99 +++++++++++
 docs/alternator/alternator.md                |   8 +-
 5 files changed, 173 insertions(+), 116 deletions(-)
2020-02-16 11:22:30 +02:00
Pavel Solodovnikov
76a0652deb types: fix serialization and validation of empty values
Empty values (zero-sized string in serialized form) were not
handled properly in serialize routines for floating types and
uuids, which led to runtime exceptions and failing tests as
described in https://github.com/scylladb/scylla/issues/5782.

Also fix validation visitor to handle empty values properly.

There already was the code in place that took into
consideration zero-sized values. But it was trying to read
some bytes regardless of that (e.g. for timeuuid values),
even if there is none to read.

Tests: unit(dev, debug)

Fixes: #5782

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200213130021.31598-1-pa.solodovnikov@scylladb.com>
2020-02-16 11:22:30 +02:00
Pavel Emelyanov
b11cf6e950 cql3/query_processor.hh: Debloat from other headers
This gives ~30% less (251 jobs -> 181 jobs) recompile when touching it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200212225828.3374-1-xemul@scylladb.com>
2020-02-16 11:22:30 +02:00
Alejo Sanchez
a5516767d5 tests: enforce SERIAL consistency on all prepared statements
Add SERIAL consistency level query option to boost tests.
This is required for LWT testing.

Refs: #5777

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20200212102921.27139-2-alejo.sanchez@scylladb.com>
2020-02-16 11:22:29 +02:00
Konstantin Osipov
7b7462b49f test.py: fix a bug with an incorrect glob pattern
On start, test.py cleans up testlog directory.
The cleanup file search pattern was shell style, not python
glob style, which led to .log files being left around
between runs.
Message-Id: <20200212204047.22398-9-kostja@scylladb.com>
2020-02-16 11:22:29 +02:00
Konstantin Osipov
70fcbd8e32 test.py: print test invocation failure to test log
Capture test invocation failure in the test log.
Remove dead code lingering from introduction of log output.
Message-Id: <20200212204047.22398-6-kostja@scylladb.com>
2020-02-15 17:19:28 +02:00
Konstantin Osipov
851b2d652e test.py: start run_test() by opening test log file
Always open the log file first, this will be necessary to append
output to it in case the test timed out or didn't start.
Message-Id: <20200212204047.22398-5-kostja@scylladb.com>
2020-02-15 17:19:28 +02:00
Konstantin Osipov
22a050250e test.py: if a test fails, print it on its own line, even in compact mode
To be able to easily see what tests have failed as they run,
print failed tests on their own line even if --verbose switch is off.
Message-Id: <20200212204047.22398-4-kostja@scylladb.com>
2020-02-15 17:19:28 +02:00
Konstantin Osipov
8eb127279e test.py: convert cookie to TabularConsoleOutput class
test.py used a functional programming cookie pattern to
carry tabular console output state, convert this cookie
to an object.
In order to make console output more pretty we'll need to
add more state to it, and keeping this state in a tuple
would be too messy.
Message-Id: <20200212204047.22398-3-kostja@scylladb.com>
2020-02-15 17:19:28 +02:00
Avi Kivity
91c4409376 locator: token_metadata: remove unused include "query-request.hh"
sstable_datafile_test.cc lost access to interval_map (via
position_in_partition.hh), so it now includes that directly.
2020-02-14 20:46:25 +02:00
Avi Kivity
bee1cc42fe locator: token_metadata: move implementation classes to .cc
With pimplification complete, move the implementation classes to .cc and
remove boost/icl includes.
2020-02-14 20:34:44 +02:00
Avi Kivity
ef41b45142 locator: token_metadata: pimplify tokens_iterator
Because tokens_iterator refers to token_metadata_impl, the latter cannot
be moved out-of-line. So this patch pimplifies tokens_iterator as well.
2020-02-14 20:29:14 +02:00
Avi Kivity
9425e9c13d locator: token_metadata: make token_metadata_impl::tokens_iterator a non-nested class
In order to pimplify token_metadata_impl::tokens_iterator, we must make it
a non-nested class, since eventually token_metadata_impl will be an incomplete
class for users and nested classes cannot be forward declared. So this patch
makes it a non-nested class. Two inline functions that referred to it were
moved out of class scope so they can see the definition.

No functional changes.
2020-02-14 20:29:13 +02:00
Avi Kivity
6d53f240d1 locator: token_metadata: pimplify
token_metadata is a heavyweight class, with heavyweight include
dependencies (icl, which has tens of thousands of lines in headers),
heavyweight methods, but it rarely used. So it is a classic candidate
for pimmplication.

This patch splits off the implementation into token_metadata_impl
and leaves token_metadata as a forwarding class. Actual movement
of the code is left to a later patch to ease review.

Notes:
 - some constructors were made public due to limitations of std::make_unique
 - a few token_metadata methods pass *this along to external functions, so we
   now pass the holder object as "unpimplified_this" to support this.
2020-02-14 20:29:12 +02:00
Avi Kivity
90a3670952 locator: token_metadata: use non-deduced return type for ring_range()
Deduced return types are user hostile as the user has to look at the
implementation in order to understand what the return type is.
2020-02-14 15:44:46 +02:00
Konstantin Osipov
8b2ce03ce4 query_processor: add CQL logging to all major execute call sites.
Add missing CQL query logging to statement prepare, internal execute,
batch execute.
The logging is done under log level "trace".
2020-02-13 21:53:58 +03:00
Botond Dénes
78624b5069 test: sstable_datafile_test: add scrub unit test 2020-02-13 15:02:37 +02:00
Botond Dénes
26d4c8be95 compaction_manager: scrub: don't piggy-back on upgrade_sstables()
Now that we have the necessary infrastructure to do actual scrubbing,
don't rely on `upgrade_sstables()` anymore behind the scenes, instead do
an actual scrub.

Also, use the skip-corrupted flag.
2020-02-13 15:02:37 +02:00
Botond Dénes
33c126e8c0 compaction: introduce scrub_compaction
A specialized compaction subclass for executing a scrub compaction.
`scrub_compaction` supplies a specialized reader which will validate its
input and stop on the first error. If it is configured with
`skip_corrupted`, it will instead skip bad data, logging it.
2020-02-13 15:02:37 +02:00
Botond Dénes
1b7725af4b mutation_fragment_stream_validator: split into low-level and high-level API
The low-level validator allows fine-grained validation of different
aspects of monotonicity of a fragment stream. It doesn't do any error
handling. Since different aspects can be validated with different
functions, this allows callers to understand what exactly is invalid.

The high-level API is the previous fragment filter one. This is now
built on the low-level API.

This division allows for advanced use cases where the user of the
validator wants to do all error handling and wants to decide exactly
what monotonicity to validate. The motivating use-case is scrubbing
compaction, added in the next patches.
2020-02-13 15:02:32 +02:00
Juliusz Stasiewicz
c13e935eae tests/cdc: added test for TTL on log table cells 2020-02-13 14:00:53 +01:00
Piotr Sarna
f4d03d6063 docs: update write isolation documentation
The documentation now mentions all acceptable variants
of write isolation configuration values.
2020-02-13 13:51:31 +01:00
Piotr Sarna
8795323678 alternator-test: mark write isolation tests scylla_only
With scylla_only fixture already available, manual checks
for dynamodb no longer need to be performed.
2020-02-13 13:51:31 +01:00
Piotr Sarna
fba756858e alternator-test: add test for write isolation tag
Write isolation tags now accept only a small set of valid values.
The test case ensures that all valid values are accepted
and that invalid values return an error.
2020-02-13 13:51:31 +01:00
Piotr Sarna
fa4ddd2947 alternator: add validating write_isolation tag
In order to prevent users from using incorrect write isolation
configuration, a set of allowed values is introduced.
When tagging a resource (which is considered rare), a tag
will only be allowed if it belongs to the allowed set.
2020-02-13 13:51:31 +01:00
Piotr Sarna
7e6c9cad9a alternator: move rmw_operation to a header
rmw_operation is a class with a public interface, including
a write_isolation enum and a fixed tag name for its configuration.
For convenience, it's moved to a header file, so that code
from executor.cc can use the definitions regardless of their
position in the source file - it prevents reordering functions
just to make sure that rmw_operation is defined before a function
that uses its attributes.
2020-02-13 13:51:31 +01:00
Konstantin Osipov
ced778ba0b query_procesor: move raw_cql_statement to cql_statement
We'd like to log CQL statements inside batches, and they don't
have prepared_statement object created for them.
2020-02-13 13:35:37 +03:00
Piotr Sarna
f4a05e1d23 alternator-test: add test for recreating a table
The first iteration of keyspace-per-table approach for alternator
revealed an issue with recreating a table after deleting it.
This test case was used as a regression check.
2020-02-13 09:54:12 +01:00
Piotr Sarna
dca6c2c81d alternator: move to NetworkTopologyStrategy
Imstead of SimpleStrategy, NetworkTopologyStrategy is used
for setting up the replication configuration for alternator tables.
Replication factor 3 is used along with a local datacenter,
unless alternator discovers that it's running on a test cluster with
less than 3 nodes - then, RF is reduced accordingly and emits a warning,
which was also the case for SimpleStrategy.
2020-02-13 09:46:46 +01:00
Piotr Sarna
3eb6da224b alternator: switch to keyspace-per-table approach
Instead of a monolith alternator keyspace, each table creates its own
keyspace, named in the following pattern: `a#TABLE_NAME`.
The `a#` prefix contains an illegal CQL character in order to ensure
that these keyspaces are never created via CQL.
2020-02-13 09:46:19 +01:00
Konstantin Osipov
b531a6fe82 query_processor: set raw_cql_statement consistently
raw_cql_statement is a member of prepared_statement which
is not set in its constructor because prepared_statement
constructor has too many call sites inside cql_statement
hierarchy.

cql_statement and prepared_statement dependency form a
cycle and long term it obviously should be fixed.

As a quick fix to query processor tracing, consistently
assign raw_cql_statement in all prepared_statement
usage sites.
2020-02-13 11:18:32 +03:00
Piotr Sarna
dcf54331ea alternator: allow custom names for keyspaces
The maybe_create_keyspace utility now accepts a parameter - the desired
name for a newly created keyspace.
2020-02-13 09:16:37 +01:00
Piotr Sarna
e93c54e837 db,view: fix generating view updates for partition tombstones
The update generation path must track and apply all tombstones,
both from the existing base row (if read-before-write was needed)
and for the new row. One such path contained an error, because
it assumed that if the existing row is empty, then the update
can be simply generated from the new row. However, lack of the
existing row can also be the result of a partition/range tombstone.
If that's the case, it needs to be applied, because it's entirely
possible that this partition row also hides the new row.
Without taking the partition tombstone into account, creating
a future tombstone and inserting an out-of-order write before it
in the base table can result in ghost rows in the view table.
This patch comes with a test which was proven to fail before the
changes.

Branches 3.1,3.2,3.3
Fixes #5793

Tests: unit(dev)
Message-Id: <8d3b2abad31572668693ab585f37f4af5bb7577a.1581525398.git.sarna@scylladb.com>
2020-02-12 23:16:30 +02:00
Tomasz Grabiec
3252068588 Merge "Multiple cleanups in cql3" from Kostja
These series were born when working on debugging (missing)
query processor trace-level logging, and trying to identify
all entry points into parsed_statement::prepare().

Unfortunately I was unable to easily merge prepared_statement
and cql_statement objects.

Rationale for individual patches is given in commit comments.
2020-02-12 17:33:39 +01:00
Nadav Har'El
b93204d6bf Alternator: allow CreateTable with streams explicitly turned off
While Alternator doesn't yet support creating a table with streams
(i.e., CDC) turned on, we should only failed the creation if streams
were really turned on. If the StreamSpecification option exists, but
does *not* ask to turn on streams, we should not fail the creation -
and this patch fixes this.

This patch also adds two tests - one where StreamSpecification is
passed but does not ask to turn on streams (so table creation should
succeed), and another test which explicitly requests to turn on
streams. The second test still xfails on Alternator, and should continue
to do so until we implement streams (we do *not* want to silently
ignore a request to turn on streams).

Fixes #5796

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200212100546.16337-1-nyh@scylladb.com>
2020-02-12 17:29:02 +01:00
Avi Kivity
48b694df55 cql3: like_matcher: pimplify to reduce inclusions of boost/regex
boost/regex has huge header dependencies amounting to tens of thousands of
lines. This are now replicated in 167 translation units.

This patch converts like_matcher to use the pointer-to-implementation
idiom, which reduces the number of translations including boost/regex
to 28.

Since regular expressions are relatively expensive, and like_matcher is
relatively rare, the extra memory usage and run time will be
negligible.
Message-Id: <20200211170152.809554-1-avi@scylladb.com>
2020-02-12 17:04:12 +02:00
Konstantin Osipov
d4866c1a28 cql3: remove prepared alias for prepared_statement
cql3 has cql_statement, parsed_statement and prepared_statement
classes, which, largely, stand for the same thing. prepared was
an alias for prepared_statement which only required an extra
tag jump in IDE and carried no meaning.
2020-02-12 16:44:43 +03:00
Konstantin Osipov
cfdef844d8 cql3: remove unused include from parsed_statement.hh 2020-02-12 16:44:43 +03:00
Konstantin Osipov
bcb094c87a query_processor: move parsed_statement definition to raw/
This is where parsed_statement declaration resides,
put the definition next to declaration as is conventional
for the rest of the classes.
2020-02-12 16:44:43 +03:00
Konstantin Osipov
93db4d748c query_processor: fold one execute_internal() into another.
All internal execution always uses query text as a key in the
cache of internal prepared statements. There is no need
to publish API for executing an internal prepared statement object.

The folded execute_internal() calls an internal prepare() and then
internal execute().
execute_internal(cache=true) does exactly that.
2020-02-12 16:44:12 +03:00
Konstantin Osipov
2e07c76153 query_processor: rename process_statement_prepared
Rename process_statement_prepared to execute_prepared
for consistency with the rest of query_processor API.
2020-02-12 16:37:08 +03:00
Konstantin Osipov
1a53458239 query_processor: rename one overload of process()
Rename an overloaded function process() to execute_direct().
Execute direct is a common term for executing a statement
that was not previously prepared. See, for example
SQLExecuteDirect in ODBC/SQL CLI specification,
mysql_stmt_execute_direct() in MySQL C API or EXECUTE DIRECT
in Postgres XC.
2020-02-12 16:36:56 +03:00
Konstantin Osipov
170d41acf4 query_processor: fold process_statement_unprepared into process()
process_statement_unprepared() is used in ::process() only and
can be inlined.

This will simplify understading CQL log output.
2020-02-12 16:22:15 +03:00
Piotr Sarna
f4e51a96ca alternator: replace overloaded with overloaded_functor
Turns out we already have a utility header for a visitor
with overloaded lambdas. This patch purges the explicit
reimplementation of the same trick and uses the existing
class instead.
Message-Id: <60c0b9a978f8208b188ef6ddc0564cb133bed707.1581496049.git.sarna@scylladb.com>
2020-02-12 14:21:42 +02:00
Amnon Heiman
8581617e78 api/storage_service: protect the objects during function call
The list_snapshot API, uses http stream to stream the result to the
caller.

It needs to keep all objects and stream alive until the stream is closed.

This patch adds do_with to hold these objects during the lifetime of the
function.

Fixes #5752
2020-02-12 13:08:34 +02:00
Calle Wilund
5e46079e89 exceptions: Set correct error code in truncate_exception
Refs #4924

truncate_exception should, like its origin counterpart, set
error code to TRUNCATE_ERROR, not PROTOCOL_ERROR.

tests: unit + partial dtest
Message-Id: <20200212100920.14478-1-calle@scylladb.com>
2020-02-12 11:17:16 +01:00
Avi Kivity
da00530464 Update seastar submodule
* seastar 1c7bccc500...6d2ed8cdc6 (11):
  > connect_test: keep socket alive until the end.
  > Merge "Add timeout to smp::submit_to() and friends" from Botond
  > reactor: use reference to addrlen in accept
  > tests: stall_detector_test: use same clock as in test as in the detector
  > reactor: fallback to epoll backend when fs.aio-max-nr is too small
  > util: move read_sys_file_as() from iotune to seastar header, rename read_first_line_as()
  > core/resources: fix cpuset error
  > distributed_tests: increase sleep time further
  > core: thread: Fix compilation error in comment
  > reactor: specialize the pollable_fd_state
  > build: Use with -fstack-clash-protection when using guard pages
2020-02-12 12:07:00 +02:00
Avi Kivity
a8a4e584ec Merge "Move token_metadata from storage_service" from Pavel
"
Lots of code needs storage_service just to get token_metadata from.
This creates unwanted dependency loops and increases the use of
global storage_service instance.

This set keeps the sharded<locator::token_metadata> on main's stack
and carries the references where needed. This removes the dependency
on storage_service from:

- storage_proxy
- gossiper
- redis
- batchlog manager

and makes the database only need it for sstables_format (will fix
in one of the next sets).

Also, this set is the prerequisite for controlling the copying of
token_metadata instances (spotted two occurrences in bootstrap
code).

Tests: unit(dev), manual start-stop
"

* 'br-token-metadata-standalone-2' of https://github.com/xemul/scylla:
  api: Keep and use reference on token_metadata
  redis: Use proxy token_metadata
  gossiper: Keep needed for failure_detection values on board
  database: Use own token_metadata
  batchlog: Use token_metadata from proxy
  proxy: Use own token_metadata
  gossiper: Use own token_metadata
  tokens: Switch into standalone sharded instance
  batchlog: Use in-config ring-delay
  database: Have it in size_estimate_virtual_reader
  storage_proxy: Pass token_metadata in some static helpers
  storage_service: Move get_local_tokens wrapper
  size_estimates_virtual_reader: Make get_local_ranges static
  migration_manager: Refactor validation of new/updating ksm
  storage_service: Tiny cleanup of excessive self-reference
2020-02-11 19:15:22 +02:00
Botond Dénes
7d3bce403d sstables: compaction_stop_exception: add retry flag
Allow the thrower to communicate that it doesn't want the compaction to
be retried later. I know, using exceptions for control flow is *very*
bad, but this is the existing mechanism to stop a compaction and I don't
want to invent a new one for this.

Also massage the error messages a bit to take the value of this flag
into consideration.
2020-02-11 18:38:35 +02:00
Avi Kivity
ba30a4074d Merge "stop passing tracing state pointer in client_state" from Gleb
"
client_state is used simultaneously by many requests running in parallel
while tracing state pointer is per request. Both those facts do not sit
well together and as a result sometimes tracing state is being overwritten
while still been used by active request which may cause incorrect trace
or even a crash.
"

Fixes #5700.

* 'gleb/tracing_fix_v1' of github.com:scylladb/seastar-dev:
  client_state: drop the pointer to a tracing state from client_state
  transport: pass tracing state explicitly instead of relying on it been in the client_state
  alternator: pass tracing state explicitly instead of relying on it been in the client_state
2020-02-11 17:59:20 +02:00
Botond Dénes
8014c7124d compaction_manager: collect all cleanup related logic in perform_cleanup()
Currently the call chain for a cleanup collection looks like this:
compaction_manager::perform_cleanup()
    compaction_manager::rewrite_sstables()
        table::cleanup_sstables()
            ...

`perform_cleanup()` is essentially empty, immediately deferring to
`rewrite_sstables()`. Cleanup related logic is scattered between the
latter two methods on the call chain. These methods however recently
started serving as generic methods for compactions that want to
rewrite each sstable one-by-one, collecting cleanup related ifs in
various places.
The reason is historic, we first had cleanup, then bolted others on top,
trying to share the underlying code as much as possible.

It is time this is cleaned up (pun intended). Make `perform_cleanup()`
the place where all cleanup related logic is, with the rest of the stack
made truly generic.
2020-02-11 17:47:44 +02:00
Botond Dénes
b2dc5d4895 compaction: compaction_descriptor: use compaction options instead of cleanup flag
Instead of the restrictive `cleanup` boolean flag, which allows for choosing
between only two compaction types, use `compaction_options`, which in
addition to allowing any number of compaction types to be selected,
also allows seamlessly passing specific options to them.
2020-02-11 17:47:44 +02:00
Botond Dénes
8579bef076 compaction: introduce compaction_options
Currently the compaction API is quite restrictive. It offers a generic
`compact_sstables()` and `reshard_sstables()` methods. The former is the
one used by all but resharding, however it only really supports two
modes: regular and cleanup. The latter is supported by a semi-hidden
`cleanup` flag in `compaction_description`. Actually there are two more
compaction types already which are piggy-backed on cleanup: upgrade and
scrub. The upper layers distinguish between actual cleanup and "fake"
cleanup by a `is_actual_cleanup` flag. The latter two "fake" cleanup
compactions cannot be distinguished even by the upper layers.
This is terribly confusing and hard to follow, in addition to being
restrictive.

This worked so far, because upgrade is served quite well by the cleanup
compaction type, turning off certain preparations by the above mentioned
`is_actual_cleanup` flag. Scrub is barely implemented and just an
upgrade behind the scenes.

This situation is however preventing really specializing each
compaction. Enter `compaction_options`. This variant in disguise is
designed to allow passing specific option to each compaction type, and
doubles as an enum allowing more than two low level compaction type.

This patch only adds the option class itself, propagating and handling
it will be done by the next patches.
2020-02-11 17:47:44 +02:00
Botond Dénes
6bc3b41c20 compaction: compaction_type: add Upgrade
Although we currently do support upgrade compaction, it is piggy-backed
on top of cleanup compaction. This is soon going to change, so in
preparation to that, add an `Upgrade` member to the `compaction_type`
enum.
2020-02-11 17:47:44 +02:00
Botond Dénes
0b53ccaecd table: cleanup_sstables(): only short-circuit on actual cleanup
Currently the cleanup call is short circuited if it is determined that
cleanup is not needed for the sstable to-be-cleaned-up. This is
undesired because actually not just cleanup uses this routine to rewrite
sstables, sstable-upgrade and sstable-scrub also uses it, and they want
to go on with the cleanup compaction sstables even if all data in it
belongs to the current node.

Fix: #5699
2020-02-11 17:47:44 +02:00
Nadav Har'El
9fad494572 merge: Reduce #include bloat around cql3 internals from non-cql3 users
Merged pull request https://github.com/scylladb/scylla/pull/5755 from
Avi Kivity:

This series removes some #include dependencies around cql3. It results in
30k line (6.6%) reduction in the preprocessed size of database.i, mainly
due to elimination of boost::regex (which was brought in in turn by
like_matcher). This should result in fewer and faster recompiles.

commits:
    tracing: remove #include of modification_statement.hh from table_helper
    cql3: selection: remove now-unneeded include of statement_restrictions.hh
    cql3: deinline result_set_builder::restrictions_filter constructor
    view_info: remove include of select_statement.hh
    cql3: selection: remove unnecessary include of selector_factories
    cql3: query_processor: reduce #includes
2020-02-11 15:58:29 +02:00
Juliusz Stasiewicz
67b92c584f cdc: set TTLs on CDC log cells
Cells in CDC logs used to be created while completely neglecting
TTLs (the TTLs from `cdc = {...'ttl':600}`). This patch adds TTLs
to all cells; there are no row markers, so wee need not set TTL
there.

Fixes #5688
2020-02-11 12:56:41 +01:00
Eliran Sinvani
9eb6ac7162 docker: add rsyslog for syslog support
One of the logging options for Scylla is syslog, this method,
until today wasn't supported in the docker images that are
created with the Dockerfile in the repo.
This commit add rsyslog installation, configuration and
setup for Docker.

Tests: built and ran the docker and validated the existance
of the /dev/log socket.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20200210112448.210169-1-eliransin@scylladb.com>
2020-02-11 13:30:59 +02:00
Tomasz Grabiec
165913598b Revert "features: Stop on shutdown"
This reverts commit ca55c6c15f.

Triggers the broken promise exception on aborted stop.

If the feature service is stopped without enabling some features,
the later may end up with "broken promise" exception on futures
attached to the _pr promise.
2020-02-11 11:57:22 +01:00
Botond Dénes
3164456108 row: append(): downgrade assert to on_internal_error()
This assert, added by 060e3f8 is supposed to make sure the invariant of
the append() is respected, in order to prevent building an invalid row.
The assert however proved to be too harsh, as it converts any bug
causing out-of-order clustering rows into cluster unavailability.
Downgrade it to on_internal_error(). This will still prevent corrupt
data from spreading in the cluster, without the unavailability caused by
the assert.

Fixes: #5786
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200211083829.915031-1-bdenes@scylladb.com>
2020-02-11 11:07:42 +02:00
Piotr Sarna
b977aa034b Merge 'cdc: disallow negative TTL values in CDC options' from Juliusz
Setting TTL = -1 in cdc_options prevents any writes to CDC log.
But enabling CDC and having unwritable log table makes no sense.

Notably, normal writes USING TTL -1 are forbidden. This patch does
the same to TTLs in CDC options.

Fixes #5747

* jul-stas/5747-cdc-disallow-negative-ttl:
  tests/cdc: added test for exception when TTL < 0
  cdc: disallow negative TTL values in CDC
2020-02-11 09:23:56 +01:00
Pavel Emelyanov
ac998e9576 repair: Do not explicitly switch sched group
When registering callbacks for row-level repair verbs the
sched groups is assigned automatically with the help of
 messaging_service::scheduling_group_for_verb. Thus the
the lambda will be called in the needed sched group, no
need for manual switch.

This removes the last occurence of global storage_service
usage from row-level repair.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 22:15:44 +03:00
Pavel Emelyanov
ccc102affa repair: Use db from callee
The do_repair_start() emulates db.invoke_on_all and can
re-use the db.local() inside without the need to call for
global storage_service instance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 22:13:03 +03:00
Pavel Emelyanov
c6ddd21c50 repair_writer: Use db from repair_meta
The caller of repair_writer.create_writer al ready
have the needed reference on database, no need to
get it from global storage_service instance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 22:10:42 +03:00
Juliusz Stasiewicz
c0edc2bf53 tests/cdc: added test for exception when TTL < 0 2020-02-10 19:13:59 +01:00
Pavel Emelyanov
5434e412e4 api: Keep and use reference on token_metadata 2020-02-10 20:54:32 +03:00
Pavel Emelyanov
4b2307c8b6 redis: Use proxy token_metadata
This removes dependency between redis and storage_service
2020-02-10 20:54:32 +03:00
Pavel Emelyanov
eb827c9f5d gossiper: Keep needed for failure_detection values on board
And drop the gossiper -> storage_service link

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 20:54:32 +03:00
Pavel Emelyanov
1a3f78a57d database: Use own token_metadata
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 20:54:32 +03:00
Pavel Emelyanov
7cdfd94207 batchlog: Use token_metadata from proxy
This kills the second global reference on storage_service from
batchlog code and breaks the dependency loop between these two.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 20:54:32 +03:00
Pavel Emelyanov
fecea1de7e proxy: Use own token_metadata
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 20:54:32 +03:00
Pavel Emelyanov
2f3490dc8d gossiper: Use own token_metadata
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 20:54:32 +03:00
Pavel Emelyanov
c5997b573c tokens: Switch into standalone sharded instance
Way too many places in code needs storage_service just for token_metadata.
These references increase the amount of get(_local)?_storage_service()
calls and create loops in components dependencies. Keep the token_metadata
separately from storage_service and pass instances' references where
needed (for now -- only into the storage_service itself).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 20:54:32 +03:00
Pavel Emelyanov
b4e66ddf1d batchlog: Use in-config ring-delay
This kills the first (out of two) global reference on storage_service

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 20:54:32 +03:00
Pavel Emelyanov
9257346c18 database: Have it in size_estimate_virtual_reader
This is to remove the last global reference on storage_service

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 20:54:32 +03:00
Pavel Emelyanov
bf5be0e971 storage_proxy: Pass token_metadata in some static helpers
Soon there will be token_metadata on storage_proxy, so
prepare for that in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 20:54:32 +03:00
Pavel Emelyanov
6050c559a3 storage_service: Move get_local_tokens wrapper
This wrapper just makes sure the system_keyspace::get_saved_tokens
reports non empty result. Move them close together.

As a side effect -- get rid of penultimate global storage_service
reference from size_estimates_virtual_reader (the last one will
be removed soon).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 20:54:31 +03:00
Piotr Sarna
bfd7d74b0f Merge 'Protect CDC-related tables from being modified by the user' from Piotr
This patch introduces following modifications:

    Disallows enabling cdc for table X when X_scylla_cdc_log already exists,
    Restricts DROP permissions for X_scylla_cdc_log tables,
    Restricts ALTER and DROP permissions for cdc_description and cdc_topology_description,
    Disallows cdc option when creating materialized views.

Refs #4991.
Tests: unit(dev).

* piodul/4991-permissions-for-cdc-tables:
  cdc: disallow CDC options for materialized views
  cdc: restrict permissions on cdc_(topology_)description
  cdc: restrict permissions on _scylla_cdc_log tables
  cdc: refuse to enable cdc when table _scylla_cdc_log exists
2020-02-10 18:02:43 +01:00
Raphael S. Carvalho
140520ff87 sstables/compaction_manager: add metric for pending compaction tasks
we have compaction_manager.compactions metric for the number of active tasks,
but they don't account for tasks blocked waiting for an opportunity to run,
and they're the problematic ones.

Fixes #5254.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200210131929.30981-1-raphaelsc@scylladb.com>
2020-02-10 17:55:02 +01:00
Pavel Emelyanov
17db6df15c size_estimates_virtual_reader: Make get_local_ranges static
There's the call of the same name in storage_service, so
make this one explicitly static for better readability.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 18:10:39 +03:00
Pavel Emelyanov
de1dc59548 migration_manager: Refactor validation of new/updating ksm
The goal is to have token_metadata reference intide the
keyspace_metadata.validate method. This can be acheived
by doing the validation through the database reference
which is "at hands" in migration_manager.

While at it, merge the validation with exists/not-exists
checks done in the same places.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 18:10:38 +03:00
Pavel Emelyanov
01a28867d6 storage_service: Tiny cleanup of excessive self-reference
Do not use get_local_storage_service inside storage_service method

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-10 18:10:38 +03:00
Piotr Dulikowski
949642b866 cdc: disallow CDC options for materialized views
While it didn't have any effect, it was possible to supply cdc options
for a materialized view. This change disallows it.
2020-02-10 15:51:11 +01:00
Piotr Dulikowski
81fa59e178 cdc: restrict permissions on cdc_(topology_)description
Following permissions are disallowed on cdc_description and
cdc_topoplogy_description: ALTER, DROP.
2020-02-10 15:40:48 +01:00
Piotr Dulikowski
6fe4f9ded8 cdc: restrict permissions on _scylla_cdc_log tables
Disallows DROP permission on CDC log tables.
2020-02-10 15:40:48 +01:00
Piotr Dulikowski
0c18742997 cdc: refuse to enable cdc when table _scylla_cdc_log exists 2020-02-10 15:40:48 +01:00
Gleb Natapov
31cf2434d6 client_state: drop the pointer to a tracing state from client_state
client_state is shared between requests and tracing state is per
request. It is not safe to use the former as a container for the later
since a state can be overwritten prematurely by subsequent requests.
2020-02-10 14:59:22 +02:00
Takuya ASADA
43097854a5 dist/debian: keep /etc/systemd .conf files on 'remove'
Since dpkg does not re-install conffiles when it removed by user,
currently we are missing dependencies.conf and sysconfdir.conf on rollback.
To prevent this, we need to stop running
'rm -rf /etc/systemd/system/scylla-server.service.d/' on 'remove'.

Fixes #5734
2020-02-10 14:54:25 +02:00
Gleb Natapov
9f1f60fc38 transport: pass tracing state explicitly instead of relying on it been in the client_state
Multiple requests can use the same client_state simultaneously, so it is
not safe to use it as a container for a tracing state which is per request.
Currently next request may overwrite tracing state for previous one
causing, in a best case, wrong trace to be taken or crash if overwritten
pointer is freed prematurely.

Fixes #5700
2020-02-10 14:54:15 +02:00
Gleb Natapov
38fcab3db4 alternator: pass tracing state explicitly instead of relying on it been in the client_state
Multiple requests can use the same client_state simultaneously, so it is
not safe to use it as a container for a tracing state which is per
request. This is not yet an issue for the alternator since it creates
new client_state object for each request, but first of all it should not
and second trace state will be dropped from the client_state, by later
patch.
2020-02-10 14:50:55 +02:00
Juliusz Stasiewicz
133156ddcf cdc: disallow negative TTL values in CDC 2020-02-10 13:50:00 +01:00
Kamil Braun
6c4f2b9717 storage_service: check for CDC flag in start_gossiping
This is a bug: we tried to retrieve the CDC streams timestamp even if
CDC flag was not enabled in storage_service::start_gossiping.
2020-02-10 14:30:35 +02:00
Takuya ASADA
b6988112b4 scylla_post_install.sh: fix operator precedence issue with multiple statements
In bash, 'A || B && C' will be problem because when A is true, then it will be
evaluates C, since && and || have the same precedence.
To avoid the issue we need make B && C in one statement.

Fixes #5764
2020-02-10 14:29:40 +02:00
Avi Kivity
bed61b96a2 Merge "Move features from storage- into feature-service" from Pavel
"
There's a lot of code around that needs storage service purely to
get the specific feature value (cluster_supports_<something> calls).
This creates several circular dependencies, e.g. storage_service <->
migration_manager one and database <-> storage_servuce. Also features
sit on storage_service, but register themselfs on the feature_service
and the former subscribes on them back which also looks strange.

I propose to keep all the features on feature_service, this keeps the
latter intependent from other components, makes it possible to break
one of the mentioned circle dependencyand heavily relax the other.

Also the set helps us fighting the globals and, after it, the
feature_service can be safely stopped at the very last moment.

Tests: unit(dev), manual debug build start-stop
"

* 'br-features-to-service-5' of https://github.com/xemul/scylla:
  gossiper: Avoid string merge-split for nothing
  features: Stop on shutdown
  storage_service: Remove helpers
  storage_service: Prepare to switch from on-board feature helpers
  cql3: Check feature in .validate
  database: Use feature service
  storage_proxy: Use feature service
  migration_manager: Use feature service
  start: Pass needed feature as argument into migrate_truncation_records
  features: Unfriend storage_service
  features: Simplify feature registration
  features: Introduce known_feature_set
  features: Move disabled features set from storage_service
  features: Move schema_features helper
  features: Move all features from storage_service to feature_service
  storage_service: Use feature_config from _feature_service
  features: Add feature_config
  storage_service: Kill set_disabled_features
  gms: Move features stuff into own .cc file
  migration_manager: Move some fns into class
2020-02-09 19:22:07 +02:00
Calle Wilund
af963e76c7 keyspace/distributed_loader: Add wait for (user) keyspace population to finish
Allows caller to check/wait for a given user keyspace to finish
populating on boot.

Can be called at any time, though if called before population
starts, it will wait until it either starts and we can determine
that the keyspace does not need populating, or population finishes.

tests: unit

Message-Id: <20200203151712.10003-1-calle@scylladb.com>
2020-02-09 18:56:22 +02:00
Pavel Emelyanov
d1775dd701 utils: Move disk-error-handler into it
The disk-error-handler is purely auxiliary thing that helps
propagating IO errors to the rest of the code. It well
deserves not sitting in the root namespace.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200207112443.18475-1-xemul@scylladb.com>
2020-02-09 17:26:52 +02:00
Pavel Solodovnikov
bcc4647552 lwt: fix handling of nulls in parameter markers for LWT queries
This patch affects the LWT queries with IF conditions of the
following form: `IF col in :value`, i.e. if the parameter
marker is used.

When executing a prepared query with a bound value
of `(None,)` (tuple with null, example for Python driver), it is
serialized not as NULL but as "empty" value (serialization
format differs in each case).

Therefore, Scylla deserializes the parameters in the request as
empty `data_value` instances, which are, in turn, translated
to non-empty `bytes_opt` with empty byte-string value later.

Account for this case too in the CAS condition evaluation code.

Example of a problem this patch aims to fix:

Suppose we have a table `tbl` with a boolean field `test` and
INSERT a row with NULL value for the `test` column.

Then the following update query fails to apply due to the
error in IF condition evaluation code (assume `v=(null)`):
`UPDATE tbl SET test=false WHERE key=0 IF test IN :v`
returns false in `[applied]` column, but is expected to succeed.

Tests: unit(debug, dev), dtest(prepared stmt LWT tests at https://github.com/scylladb/scylla-dtest/pull/1286)

Fixes: #5710

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200205102039.35851-1-pa.solodovnikov@scylladb.com>
2020-02-09 16:50:42 +02:00
Avi Kivity
b26ded8ec5 tracing: remove #include of modification_statement.hh from table_helper
Replace with a forward declration to reduce #include bloat and dependencies.
2020-02-09 13:04:13 +02:00
Avi Kivity
f8e85e5c2a cql3: selection: remove now-unneeded include of statement_restrictions.hh
Actual users gain #includes of statement_restrictions and query_options that
they previously got through selection.hh.
2020-02-09 13:01:32 +02:00
Avi Kivity
710e4ec99d cql3: deinline result_set_builder::restrictions_filter constructor
It stands in the way of #include removal, so it must go. It should
have no performance impact as it is too large to be inlined.
2020-02-09 13:00:17 +02:00
Avi Kivity
c6118d96d2 view_info: remove include of select_statement.hh
It is not needed by users of view_info.
2020-02-09 12:43:33 +02:00
Avi Kivity
7474db4075 cql3: selection: remove unnecessary include of selector_factories
It is only mentioned in the header file, so the forward declaration
can be used and the include moved to the real users.
2020-02-09 12:37:36 +02:00
Avi Kivity
dcab666d52 cql3: query_processor: reduce #includes
query_processor is a central class, so reducing its includes
can reduce dependencies treewite. This patch removes includes
for parsed_statement, cf_statement, and untyped_result_set and
fixes up the rest of the tree to include what it lacks as a result
of these removals.
2020-02-09 12:24:24 +02:00
Nadav Har'El
576f80be74 alternator-test: add comprehensive tests for KeyConditionExpression
This patch adds comprehensive tests for KeyConditionExpression, the newer
DynamoDB API syntax for specifying the item range which is requested from
a Query (this syntax replaced the older KeyConditions syntax, which
Alternator already supports).

Before this patch, we had only a small test for KeyConditionExpression
in test_query.py. This patch replaces it by a large number of small
tests, testing the many sub-features of KeyConditionExpression -
its different operators, sort-key types, different failure modes, etc.

As usual, because we haven't yet implemented this feature in Alternator
(see issue #5037), all these tests pass on AWS, but xfail on Alternator.

Despite the new test file containing about 40 small tests, it finishes
very quickly because we use pytest's fixture feature to allow small
read-only tests to perform a query to a partition that is only written
once for many tests. So these small tests become extremely fast, and
there is no downside to having many small tests instead of lumping them
into fewer large tests checking many things.

Refs #5037.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200207134159.3283-1-nyh@scylladb.com>
2020-02-08 11:10:09 +02:00
Piotr Dulikowski
534e9ba27d cdc: store information on ttl in "ttl" column, not in tuples
This patch changes the way TTL is stored in the CDC log table. Instead
of including TTL of cell `X` in the third element of the tuple in column
`_X`, TTL is written to the previously unused column `ttl`. This is done
for cosmetic purposes.

This implementation works under assumption that there will be only one
TTL included in a mutation coming from a CQL write. This might not be
the case when writing a batch that modifies the same row twice, e.g.:

```
BATCH
INSERT INTO ks.t (pk, ck, v1) VALUES (1,2,3) USING TTL 10;
INSERT INTO ks.t (pk, ck, v2) VALUES (1,2,3) USING TTL 20;
END BATCH
```

In this case, this implementation will choose only one TTL value to be
written in the CDC log:

```
... | batch_seq_no | _ck | _pk | _v1    | _v2    | operation | ttl
...-+--------------+-----+-----+--------+--------+-----------+-----
... |            0 |   2 |   1 | (0, 3) | (0, 3) |         1 |  20
```

This behavior might be changed as a part of issue #5719, which considers
splitting a batch write mutation when it contains multiple writes to the
same row.

Refs #5689
Tests: unit(dev)
2020-02-08 11:10:09 +02:00
Pavel Emelyanov
e2ec5eecf6 view_update: Do not need storage_proxy
The view_update_generator acceps (and keeps) database and storage_proxy,
the latter is only needed to initialize the view_updating_consumer which,
in turn, only needs it to get database from (to find column family).

This can be relaxed by providing the database from _generator to _consumer
directly, without using the storage_proxy in between.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200207112427.18419-1-xemul@scylladb.com>
2020-02-07 13:30:01 +02:00
Pavel Emelyanov
00746d6a16 dht: Use const reference for token_metadata arg
Two places in dht code have token_metadata _value_ arguments, but only read
tokens from them. Optimize it a bit by turning values into const references.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200207112408.18352-1-xemul@scylladb.com>
2020-02-07 13:30:00 +02:00
Avi Kivity
5950a9e37f .dockerignore: add testlog
testlog files are not used when preparing the frozen toolchain,
and can be very large, so ignore them in order to speed up the
docker build.
2020-02-07 08:59:39 +01:00
Gleb Natapov
ff88ff880b lwt: use cached truncation record instead of quering the database
Message-Id: <20200206163838.5220-3-gleb@scylladb.com>
2020-02-06 18:15:48 +01:00
Gleb Natapov
20bf3800f3 database: cache truncation time in table objects
Truncation time is used on each LWT request now, so reading it from
the table is too heave operation to be on a fast path. It also requires
jumping to a shard that contains corresponding data. This patch caches
the data on the table object of each shard for easy access. The cache is
initialized during boot from system.truncated table and updated on each
truncation operation.
Message-Id: <20200206163838.5220-2-gleb@scylladb.com>
2020-02-06 18:15:48 +01:00
Takuya ASADA
5d82fcf944 dist/ami: use prebuilt rpms on --localrpm
We made --localrpm option to automatically build rpms from sourcecode,
but we actually use the option to produce AMI using prebuilt rpm on our
CI.
To simplified the script, and to prevent accsidently start rpm build
in the script, drop rpm build part.
2020-02-06 18:41:52 +02:00
Amnon Heiman
687e554737 api/storage_service: use stream in get_snapshots
get_snapshot should use http stream to reduce memory allocation and
stalls.

This patch change the implementation so it would stream each of the
snapshot object instead of creating a single response and return it.

Fixes #5468

Depends on scylladb/seastar#723
2020-02-06 18:40:37 +02:00
Takuya ASADA
c44f347886 SCYLLA-VERSION-GEN: skip updating version files when git hash unchanged
On our build system we tries to build relocatable package multiple times on
same revision of the repository, it executes ./SCYLLA-VERSION-GEN for each time.
When the build job invoked at midnight and it did not finished until 12:00AM,
first build and last build has different SCYLLA-RELEASE-FILE, since it contains
current date.
To prevent it, skip updating SCYLLA-*-FILE when git hash unchanged.

Fixes scylladb/scylla-pkg#826
2020-02-06 18:36:46 +02:00
Botond Dénes
05116ba963 reader_concurrency_semaphore: make signal() noexcept
Currently reader_concurrency_semaphore::signal() can fail. This is
dangerous in two ways:
* It is called from constructors, so the exception can bring down the
  node. This will convert an `std::bad_alloc` to a crash.
* Reads in the queue will be blocked until they either time-out, or
  another `signal()` succeeds.

To solve this, wrap the `reader_permit` constructor, the only code that
can throw, with try-catch and forward the exception to the reader
admission promise. In practice this will result in the flushing of the
reader queue, when we fail to admit a read.

Fixes #5741
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200206154238.707031-1-bdenes@scylladb.com>
2020-02-06 17:51:03 +02:00
Botond Dénes
434d32befe reader_permit: tidy up reader_permit::memory_units
This patch is a bag of fixes/cleanups that were omitted from the reader
memory tracking series due to contributor error. It contains the
following changes:
* Get rid of unused `increase()` and `decrease()` methods.
* Make all constructors and assignment operators `noexcept`.
* Make move assignment operator safe w.r.t. self assignment.
* `reset()`: consume the new amount before releasing the old amount,
  to prevent a transient window where new readers might be admitted.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200206143007.633069-1-bdenes@scylladb.com>
2020-02-06 16:35:07 +02:00
Piotr Sarna
757c1cf91e Merge ' Remove unnecessary schema copies' from Piotr
Most of the time schema does not have to be copied and sometimes it's not even used.

tests: unit(dev)
Closes #5739

* hawk/remove_schema_copies:
  multishard_mutation_query_test: stop capturing unused schema
  index_reader: avoid copying schema to lambda
2020-02-06 15:20:24 +01:00
Piotr Jastrzebski
d1fe75edbc multishard_mutation_query_test: stop capturing unused schema
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-06 14:18:50 +01:00
Piotr Jastrzebski
8813a6ca2a index_reader: avoid copying schema to lambda
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-06 14:10:58 +01:00
Nadav Har'El
abdbb70ad9 Allow configuring alternator write isolation
Merged patch series from Piotr Sarna:

This series adds a way to confgure alternator write isolation policy
per-table with the use of tags.
Instead of hardcoded LWT_ALWAYS policy, it can now be set by tagging
a table with a tag of the following form:
{
  'Key': 'system:write_isolation',
  'Value': X
},
where X is one of the following implemented levels:
 * 'f' - forbid RMW
 * 'a' - always enforce RMW
 * 'o' - only RMW writes will go through LWT
 * 'u' - unsafe RMW (to be deprecated/eradicated)

By default, if no tag is found, alternator falls back to always applying
LWT to writes.

This series also contains fixes to the tagging interface - some minor
issue came up while implementing write isolation config on top of tags.

test: alternator-test(local,remote)

Piotr Sarna (6):
  alternator: return tags for a table via const reference
  alternator: fix overwriting tags
  alternator: make _write_isolation a protected attribute
  alternator: add configuring write isolation policy via tags
  alternator-test: add testing different write isolation policies
  docs: update alternator on write isolation

 alternator-test/test_condition_expression.py | 63 ++++++++++++++
 alternator-test/test_tag.py                  | 25 ++++++
 alternator/executor.cc                       | 89 +++++++++++++-------
 docs/alternator/alternator.md                | 21 +++--
 4 files changed, 162 insertions(+), 36 deletions(-)
2020-02-06 12:37:19 +02:00
Nadav Har'El
8b6925790f Reduce usage of global_partitioner()
Merged pull request https://github.com/scylladb/scylla/pull/5733 from
Piotr Jastrzębski:

In many places we use global_partitioner() to obtain parameters that are
available in config. This PR replaces number of global_partitioner() calls
with equivalent non-global ways.

tests: unit(dev)

* 'reduce_global_usage' of github.com:haaawk/scylla:
  storage_service: reduce number of global_partitioner calls
  cdc: remove partitioner from db_context
  gossiper: stop calling global_partitioner()
  system_keyspace: stop calling global_partitioner()
  transport/server: stop calling global_partitioner()
  thrift: stop calling global_partitioner()
  partitioner: move cpu_sharding_algorithm_name to token-sharding.hh
2020-02-06 12:10:38 +02:00
Piotr Sarna
9ac35b9367 docs: update alternator on write isolation
Docs are appended with information on write isolation - which levels
are implemented in alternator and how to configure them properly.
2020-02-06 10:26:26 +01:00
Piotr Sarna
4d3b8e3b5a alternator-test: add testing different write isolation policies
Additional testing is done via:
1. Checking that permissive isolation levels ('a', 'o', 'u') allow
   conditional writes
2. Checking that 'f' isolation level (forbid rmw) works as expected:
   - read-modify-write requests are forbidden
   - non-rmw writes are allowed
2020-02-06 10:26:26 +01:00
Piotr Sarna
4a9536b7c1 alternator: add configuring write isolation policy via tags
Until now, write isolation policy was hardcoded to always enforcing LWT.
From now on, setting a tag via UpdateTags request or during table
creation will associate a policy with given table.
The tag key is 'system:write_isolation' and its value can be one of:
 * 'f' - forbid RMW
 * 'a' - always enforce RMW
 * 'o' - only RMW writes will go through LWT
 * 'u' - unsafe RMW (to be deprecated/eradicated)
2020-02-06 10:26:26 +01:00
Piotr Sarna
0479a1bf67 alternator: make _write_isolation a protected attribute
No useful semantic changes yet, but it will help produce better
diffs for future patches.
2020-02-06 10:04:34 +01:00
Piotr Sarna
51c14cb1ce alternator: fix overwriting tags
Tagging a resource with a tag key that already exists should result
in overwriting the old value. It wasn't the case, so it's now fixed
and an appropriate test is added.
2020-02-06 10:04:34 +01:00
Piotr Sarna
ed940f000d alternator: return tags for a table via const reference
The signature of the helper function is changed, so that it's possible
to acquire a const reference of the tags, instead of being forced
to get a copy of the whole map (potentially large).
2020-02-06 10:04:34 +01:00
Piotr Sarna
f4b6f0956b Merge "Pending Alternator patches" from Nadav
Here is a rebase of some of my already-reviewed Alternator patches -
the final piece of the fix to LWT timestamps (in BatchWriteItems),
The "/localnodes" request, and a couple of patches reducing the number
of times that the global storage_proxy is needed.

Also available in a github branch, git@github.com:nyh/scylla.git series1

* nyh/series1:
  redis: remove redundant code
  storage_proxy: make it into a peering sharded service
  alternator: use simpler API for registering Alternator's HTTP URLs
  alternator-test: test "/localnodes" feature
  alternator: add public API for list of nodes in current DC
  alternator: use LWT timestamp - in BatchWriteItems too
2020-02-06 09:48:10 +01:00
Juliusz Stasiewicz
20f7b1b0ad tests: add test for CDC schema extension
Test for functionality added in #5720.
Refs #5589
2020-02-06 09:26:13 +01:00
Piotr Jastrzebski
9bfd3dc311 storage_service: reduce number of global_partitioner calls
Replace global_partitioner().sharding_ignore_msb() call
with config::murmur3_partitioner_ignore_msb_bits()

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-06 08:00:34 +01:00
Piotr Jastrzebski
97262bec82 cdc: remove partitioner from db_context
partitioner from cdc::db_context is no longer used
so it can be removed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-06 08:00:01 +01:00
Piotr Jastrzebski
61d8308848 gossiper: stop calling global_partitioner()
Obtain name of the default partitioner from config
instead of a global.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-06 07:59:07 +01:00
Piotr Jastrzebski
8b4ec5b1d2 system_keyspace: stop calling global_partitioner()
Obtain name of default partitioner from config
instead of a global.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-06 07:58:07 +01:00
Piotr Jastrzebski
d3d6547889 transport/server: stop calling global_partitioner()
Obtain SCYLLA_SHARDING_IGNORE_MSB and SCYLLA_PARTITIONER
from config instead of a global.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-06 07:57:06 +01:00
Piotr Jastrzebski
dde8c7df00 thrift: stop calling global_partitioner()
Replace global_partitioner().name() call with
config::partitioner().

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-06 07:55:54 +01:00
Piotr Jastrzebski
8817a62499 partitioner: move cpu_sharding_algorithm_name to token-sharding.hh
Sharding logic has been moved to token-sharding.hh some time ago.
This logic does not depend on partitioner any more so cpu_sharding_algorithm_name
can be safely moved to the header where rest of sharding logic lives.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-06 07:53:45 +01:00
Nadav Har'El
3f27b070e7 redis: remove redundant code
In one place, we already had a "proxy" object, but still asked for it
again. Remove the redundant line.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-02-05 21:14:18 +02:00
Nadav Har'El
9fd9ec14c2 storage_proxy: make it into a peering sharded service
We consider globals like service::get_storage_proxy() a bad idea,
and would like to reduce their use as much as possible - and eventually,
eliminate it completely.

One easy case to fix case is when we already have a shard-local proxy,
but now we need the sharded object, to invoke_on() something on it.

In this patch, we turn storage_proxy into a peering_sharded_service.
This means that if you already have a storage_proxy, you can call
its container() function to get the sharded<storage_proxy>, without
needing to call the global service::get_storage_proxy().

We found a few such cases in storage_proxy itself, and in Alternator,
and fixed them to use container() instead of the global function.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-02-05 21:14:18 +02:00
Nadav Har'El
b262eb5031 alternator: use simpler API for registering Alternator's HTTP URLs
We used the Seastar HTTP server's add() method to register URLs to
serve (so-called "routes"), but as suggested by Amnon, when we have
fixed URLs without parameters being path components, it's simpler
to use the put() method to do the same thing - and also results in
slightly less work at run-time to look up these routes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-02-05 21:14:18 +02:00
Nadav Har'El
9de26b73a4 alternator-test: test "/localnodes" feature
This is a partial test for the "/localnodes" request, which is supposed to
return the list of live nodes in this DC. Because of the limitations of our
current alternator-test framework (which should work on any pre-existing
cluster), we don't know what to expects as a reply, but we just verify the
minimum: The request is understood, returns a JSON list, which contains
at least one item.

As "/localnodes" is a Scylla-only feature, this test is skipped when
running with "--aws".

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-02-05 21:14:18 +02:00
Nadav Har'El
3fecf6f641 alternator: add public API for list of nodes in current DC
If we want to balance the Alternator request load among the different nodes
(Refs #5030), the load balancer - whether it uses HTTP load balancing or
DNS - needs to be able to get an up-to-date list of live nodes to which it
can direct Alternator traffic. This list should include only the live nodes
in the same data center (geographical region) - it is expected that a
separate load balancer will be installed in each data center, and clients
from within this data center will reach this data center's load balancer.

There are multiple APIs in current Scylla to do something similar to what
we need, but as far as I know, none of them is exactly what we need or
convenient for Alternator installations: We don't want the load balancer
to use CQL, and the REST API http://localhost:10000/gossiper/endpoint/live/
doesn't do what we need (it doesn't restrict the list to one data center)
plus it's not open to connections outside the machine.

So in this patch, we implement a new HTTP request on the Alternator port -
"/localnodes", returning a JSON-formatted list of all live nodes in the
contacted node's data center:

   $ curl http://localhost:8000/localnodes
   ["127.0.0.2","127.0.0.1","127.0.0.3"]

Like the existing health check HTTP request, this operation is public and
unauthenticated. We consider the security risk low - it allows an attacker
to enquire the list of Scylla nodes in this DC, but an attacker can achieve
the same thing by just scanning the addresses in this subnet using the health
check request (or even with ordinary DynamoDB API requests).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-02-05 21:14:18 +02:00
Nadav Har'El
95351016fd alternator: use LWT timestamp - in BatchWriteItems too
A previous patch fixed Alternator's writes to use the timestamp provided
by LWT instead of the current timestamp. That patch fixed the PutItem,
DeleteItem and UpdateItem operations - and this patch fixes the remaining
write operation: BatchWriteItems. So,

Fixes #5653.

Unfortunatly, the requirements of both BatchWriteItems and LWT make the
resulting code - and this patch - somewhat inelegant. BatchWriteItems
requires that we prepare all the operations first - failing if any of them
has an error. Before this patch, the result of this preparation was an
array of mutations, which in a second step we wrote to the database.
But we can no longer use mutations for the result of the first step,
because creating a mutation requires knowing the timestamp, which we don't
know during the preparate phase - we will only know it during the later
LWT operation. So now we need to invent a new intermediate format between
the request and the mutation. This intermediate format is further
complicated by the need to be send it between shards (for LWT's shard
forwarding) so it cannot, for example, contain a reference to a schema.
The fact that different sub-operations need to be sent to different shards,
and that different sub-operations may write to different tables, further
complicate the book-keeping and gives us a bunch of funky-typed maps.
But eventually it all fits together.

After this patch, as before this patch, the same code (now called
put_or_delete_item), is used to implement both the PutItem and DeleteItem
stand-alone operation, and the BachWriteItems operation which includes a
whole list of these PutItem and DeleteItem operation.

This patch also includes two more tests in test_batch.py, which test two
more corner tests we haven't tested before: One tests the capability of
BatchWriteItems to write to more than one table. The other tests that
BatchWriteItems can write an empty item (it is not surprising that it does,
but we do have special code for this case, so we should test it).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-02-05 21:14:18 +02:00
Avi Kivity
27b36beb4a Update seastar submodule
* seastar 30185fd901...1c7bccc500 (8):
  > reactor: rename kernel_completion::set_value to complete_with
  > net: Remove unused member variables
  > net: Fix global buffer overflow around rss_key_type
  > reactor: remove kernel_completion::set_promise
  > Merge "generalize the io_desc (now kernel_completion)" from Glauber
  > everywhere: Disable -Wmisleading-indentation around ragel generated code
  > core: Make when_all_state_component final
  > io_tester: Remove unused lambda capture
2020-02-05 20:20:43 +02:00
Gleb Natapov
ff696682ed add missing include to timestamp.hh
The file uses std::string but does include <string> header. My compiler
complains.

Message-Id: <20200205085739.GN26048@scylladb.com>
2020-02-05 19:42:18 +02:00
Avi Kivity
e719ea1bba Merge "Fix assert on initialization error" (in large_data_handler) from Rafael
"
This series fixes an assertion when initialization fails after
creating a database. I don't know of a case where that currently
happens, but it is easy to cause that when writing a patch and the
produced assert is just confusing.
"

* 'espindola/dont-assert-on-init-error' of https://github.com/espindola/scylla:
  db: Replace large_data_handler::_stopped with _running
  db: Move nop_large_data_handler constructor out-of-line
  db: Move large_data_handler::stop out-of-line
2020-02-05 18:49:11 +02:00
Juliusz Stasiewicz
5127568cc4 cdc: cdc per-table options put into schema extensions
With this patch, client tools (in particular cqlsh) get the access
to cdc options and will be able to print them with `DESC TABLE`.

Fixes #5589
2020-02-05 13:44:39 +01:00
Piotr Sarna
ee244a6d22 Merge 'Make it clear that memory_footprint_test has to be run with -c1' from Piotr
This tests fails when run on more than 1 core.

Tests: unit(dev)

* hawk/fix_memory_footprint:
  memory_footprint_test: Make it clear it has to run with -c1
  tests: move memory_footprint_test to perf/
2020-02-05 12:09:50 +01:00
Avi Kivity
31593e1451 Merge "Change token representation to int64_t" from Piotr
"
After deprecating partitioners other than Murmur3 we can change the representation of tokens to int64_t. This will allow setting custom partitioner on each table. With this change partitioners become just converters from partition keys to tokens (int64_t). Following operations are no longer dependant on partitioner implementation:

 - Tokens comparison
 - Tokens serialization/deserialization to strings
 - Tokens serialization/deserialization to bytes
 - Sharding logic
 - Random token generation

This change will be followed by a PR that enables per table partitioner and then another PR that introduces a special partitioner for CDC tables.

Tests: unit(dev)

Results of memory footprint test:

Differences:

in cache: 992 vs 984
in memtable: 750 vs 742
sizeof(cache_entry) = 112 vs 104
-- sizeof(decorated_key) = 36 vs 32
MASTER:
mutation footprint:

in cache: 992
in memtable: 750
in sstable: 351
frozen: 540
canonical: 827
query result: 342
sizeof(cache_entry) = 112
-- sizeof(decorated_key) = 36
-- sizeof(cache_link_type) = 32
-- sizeof(mutation_partition) = 96
-- -- sizeof(_static_row) = 8
-- -- sizeof(_rows) = 24
-- -- sizeof(_row_tombstones) = 40

sizeof(rows_entry) = 232
sizeof(lru_link_type) = 16
sizeof(deletable_row) = 168
sizeof(row) = 112
sizeof(atomic_cell_or_collection) = 8

THIS PATCHSET:
mutation footprint:

in cache: 984
in memtable: 742
in sstable: 351
frozen: 540
canonical: 827
query result: 342
sizeof(cache_entry) = 104
-- sizeof(decorated_key) = 32
-- sizeof(cache_link_type) = 32
-- sizeof(mutation_partition) = 96
-- -- sizeof(_static_row) = 8
-- -- sizeof(_rows) = 24
-- -- sizeof(_row_tombstones) = 40

sizeof(rows_entry) = 232
sizeof(lru_link_type) = 16
sizeof(deletable_row) = 168
sizeof(row) = 112
sizeof(atomic_cell_or_collection) = 8
"

* 'fixed_token_representation' of https://github.com/haaawk/scylla: (21 commits)
  token: cast to int64_t not long in long_token
  murmur3: move sharding logic to token and i_partitioner
  partitioner: move shard_of_minimum_token to token
  partitioner: remove token_to_bytes
  partitioner: move get_token_validator to token
  partitioner: merge tri_compare into dht::tri_compare
  partitioner: move describe_ownership to token
  partitioner: move from_bytes to token
  partitioner: move from_string to token
  partitioner: move to_sstring to token
  partitioner: move get_random_token to token
  partitioner: move midpoint function to token
  token: remove token_view
  sstables: use copy constructor for tokens
  token: change _data to int64_t
  partitioner: remove hash_large_token
  token: change data to array<uint8_t, 8>
  partitioner: Extract token to separate .hh and .cc files
  partitioner: remove unused functions
  Revert "dht/murmur3_partitioner: take private methods out of the class"
  ...
2020-02-05 12:21:02 +02:00
Piotr Jastrzebski
edd7398a0c memory_footprint_test: Make it clear it has to run with -c1
The test fails when run on number of cores different than 1.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 10:22:32 +01:00
Piotr Jastrzebski
1a8fe4befd tests: move memory_footprint_test to perf/
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 10:18:28 +01:00
Piotr Jastrzebski
6d24f26ff7 token: cast to int64_t not long in long_token
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
50cfe81331 murmur3: move sharding logic to token and i_partitioner
Since token representation is fixed now, all the partitioners
will share the sharding logic. It makes sense now to keep
the logic in common super class and separate header that's
included only in i_partitioner.cc.

shard_of and token_for_next_shard are now implemented in
i_partitioner. They would be non-virtual but we have to
keep them virtual because one test is overriding them
to enforce some specific sharding.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
7eab3024bd partitioner: move shard_of_minimum_token to token
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
9c55e5be13 partitioner: remove token_to_bytes
i_partitioner::token_to_bytes is just a call to
token::data and does not depend on partitioner
at all. It is possible to convert token to bytes
without having access to partitioner.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
d4d55160f0 partitioner: move get_token_validator to token
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
2c630c5820 partitioner: merge tri_compare into dht::tri_compare
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
d0d8bfaf8c partitioner: move describe_ownership to token
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
f845220445 partitioner: move from_bytes to token
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
8107d99e3d partitioner: move from_string to token
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
03bdce2d68 partitioner: move to_sstring to token
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
9c202b52da partitioner: move get_random_token to token
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
f42b1ee819 partitioner: move midpoint function to token
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
1d1ac476c3 token: remove token_view
Now that both token and token_view contain int64_t
it makes no sense to keep the view.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
06dfd16aad sstables: use copy constructor for tokens
instead of manually creating new token from another
token internals.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
05e0451b27 token: change _data to int64_t
Previously _data was stored as array of 8 bytes in
network byte order.
After this change it stores the same value in int64_t
in host byte order.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
fea0187f55 partitioner: remove hash_large_token
Now that token representation is always array<uint8_t, 8>,
hash<dht::token> will always pick
read_le<size_t>(reinterpret_cast<const char*>(b.data()))
and never call hash_large_token because the check
is always true b.size() == sizeof(size_t).

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:31:32 +01:00
Piotr Jastrzebski
b569d127a0 token: change data to array<uint8_t, 8>
It is save to do such change because we support only
Murmur3Partitioner which uses only tokens that are
8 bytes long.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:30:46 +01:00
Piotr Jastrzebski
0da21c28ab partitioner: Extract token to separate .hh and .cc files
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:18:24 +01:00
Piotr Jastrzebski
8bd9d3a69e partitioner: remove unused functions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:18:24 +01:00
Piotr Jastrzebski
d86548c06e Revert "dht/murmur3_partitioner: take private methods out of the class"
This patch conflicts with the following patches.
The final effect is equivalent and it's easier to revert this patch
and cleanly apply already reviewed patches.

This reverts commit f4f8593bac.
2020-02-05 09:18:24 +01:00
Piotr Jastrzebski
08036fc511 murmur3_partitioner: get rid of static shard_of
This will enable revert of a commit that creates conflicts
with following patches.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-05 09:18:24 +01:00
Rafael Ávila de Espíndola
5d4671526c db: Replace large_data_handler::_stopped with _running
This is not just a direct flip to a variable with the negated Boolean
value. When created, a large_data_handler is not considered to be
running, the user has to call start() before it can be used.

The advantaged of doing this is that if initialization fails and a
database is destructed before the large_data_handler is started, the
assert

database::stop() {
    assert(!_large_data_handler->running());

is not triggered.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-04 21:15:44 -08:00
Rafael Ávila de Espíndola
33dfe34f78 db: Move nop_large_data_handler constructor out-of-line
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-04 21:12:01 -08:00
Rafael Ávila de Espíndola
e99a225f25 db: Move large_data_handler::stop out-of-line
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-04 21:11:49 -08:00
Rafael Ávila de Espíndola
9eae0b57a3 test: Enable all experimental features in the cql_repl
The cql repl will hopefully be used to write most new tests, so it
should have all experimental features enabled.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200204173448.95892-1-espindola@scylladb.com>
2020-02-04 19:36:37 +02:00
Avi Kivity
7d70bfe20c Merge "Lua: Fix handling of list<varint> and list<decimal>" from Rafael
"
This patch series fixes #5711, enables UDF support in CQL tests and
and includes a few extra cleanups.
"

* 'espindola/lua-fixes' of https://github.com/espindola/scylla:
  lua: Use a negative index for consistency
  lua: Fix returning list<decimal>
  lua: Fix returning list<varint>
  lua: Use a lua_slice_state instead of a from_lua_visitor
  test: Enable UDF in the cql repl
2020-02-04 18:51:54 +02:00
Nadav Har'El
acafcbfdf4 alternator: use LWT timestamp, not current timestamp
The DynamoDB API doesn't have the notion of client-supplied timestamps,
so the server is supposed to use its own current timestamp for write
operations.

However, for LWT writes, we should not use this node's current time:
Different nodes may slightly differ in their clocks, and LWT needs
a monotonically-increasing notion of time for the consistent operations.
LWT provides to the operation's apply() method the specific timestamp that
it should use in its returned mutation - and we should use this timestamp,
not the current timestamp.

In the optional write modes where LWT is not used, we continue to use
the current timestamp (api::new_timestamp()) as before.

This patch fixes the PutItem, UpdateItem and DeleteItem operations.
The BatchWriteItem operation is not yet fixed by this patch - fixing
it will require more elaborate code changes so will be done in a
separate patch.

Refs #5653.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200130122853.7658-1-nyh@scylladb.com>
2020-02-04 10:18:49 +01:00
Nadav Har'El
0a23471eae alternator: switch BatchWriteItems to use LWT too
Today, we use LWT for all PutItem, UpdateItem and GetItem operations.
We do this even for pure writes - writes which do not involve a read
before the write).

But BatchWriteItem also does pure writes - and it doesn't use LWT yet.
So this patch changes it so it does. As before we keep in the code -
not yet configurable by a user - also the option to do these unconditional
writes without LWT.

A BatchWriteItem may change multiple partitions (but a fairly low number -
DynamoDB allows each BatchWriteItem to only do 25 updates) and we start the
different LWT operations in parallel.

This patch collects multiple mutations to the same partition together to
be done with a single LWT operation, so we also add a test for this case,
were we have a batch of writes involving several items in each of several
partitions.

Fixes #5637

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200128160538.11775-1-nyh@scylladb.com>
2020-02-04 10:08:18 +01:00
Rafael Ávila de Espíndola
6764316576 cql3: Simplify maybe_quote
This produce code that is just as fast as the previous implementation
and is quite a bit easier to read IMHO.

I benchmarked it by temporally adding:

BOOST_AUTO_TEST_CASE(bench_maybe_quote) {
    std::string val(1 << 20, 'x');
    using clk = std::chrono::steady_clock;
    cql3::util::maybe_quote(val);
    auto start = clk::now();
    for (int i = 0; i < 1000; ++i) {
        cql3::util::maybe_quote(val);
    }
    auto end = clk::now();
    std::chrono::duration<double> duration = end - start;
    std::cout << "delta = " << duration.count() << '\n';
}

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>

Message-Id: <20200203225140.180262-1-espindola@scylladb.com>
2020-02-04 10:52:04 +02:00
Avi Kivity
cdecb21b78 Update seastar submodule
* seastar 65980a9b30...30185fd901 (12):
  > sstring: resize: NulTerminate when downsizing
  > reactor: make open_flags::dsync respect --unsafe-bypass-fsync
  > json/json_elements: Use double quotes around element name
  > Revert "reactor: make open_flags::dsync respect --unsafe-bypass-fsync"
  > Merge "smp: reduce allocations in work_item::process" from Avi
  > task: optimize destruction by making destructor non-virtual
  > reactor: make open_flags::dsync respect --unsafe-bypass-fsync
  > Revert "sstring: resize: NulTerminate when downsizing"
  > sstring: resize: NulTerminate when downsizing
  > tests: Rename unix domain socket test for consistency
  > resource: downgrade cgroupsv2 message.
  > Merge "Simplify the stream/subscription implementation" from Rafael
2020-02-04 10:20:29 +02:00
Nadav Har'El
3de09042bb CDC topology change support
Merged pull request https://github.com/scylladb/scylla/pull/5485
by Kamil Braun:

This series introduces the notion of CDC generations: sets of CDC streams
used by the cluster to choose partition keys for CDC log writes.
Each CDC generation begins operating at a specific time point, called the
generation's timestamp (cdc_streams_timestamp in the code).
It continues being used by all nodes in the cluster to generate log writes
until superseded by a new generation.

Generations are chosen so that CDC log writes are colocated with their
corresponding base table writes, i.e. their partition keys (which are CDC
stream identifiers picked from the generation operating at time of making
the write) fall into the same vnode and shard as the corresponding base
table write partition keys. Currently this is probabilistic and not 100%
of log writes will be colocated - this will change in future commits,
after per-table partitioners are implemented.

CDC generations are a global property of the cluster -- they don't depend
on any particular table's configuration. Therefore the old "CDC stream
description tables", which were specific to each CDC-enabled table,
were removed and replaced by a new, global description table inside the
system_distributed keyspace.

A new generation is introduced and supersedes the previous one whenever
we insert new tokens into the token ring, which breaks the colocation
property of the previous generation. The new generation is chosen to
account for the new tokens and restore colocation. This happens when a
new node joins the cluster.

The joining node is responsible for creating and informing other nodes
about the new CDC generation. It does that by serializing it and inserting
into an internal distributed table ("CDC topology description table").
If it fails the insert, it fails the joining process. It then announces
the generation to other nodes through gossip using the generation's
timestamp, which is the partition key of the inserted distributed table
entry.

Nodes that learn about the new generation through gossip attempt to
retrieve it from the distributed table. This might fail - for example,
if the node is partitioned away from all replicas that hold this
generation's table entry. In that case the node might stop accepting
writes, since it knows that it should send log entries to a new generation
of streams, but it doesn't know what the generation is. The node will keep
trying to retrieve the data in the background until it succeeds or sees
that it is no longer necessary (e.g., because yet another generation
superseded this one). So we give up some availability to achieve safety.
However, this solution is not completely safe (might break consistency
properties): if a node learns about a new generation too late (if gossip
doesn't reach this node in time), the node might send writes to the wrong
(old) generation. In the future we will introduce a transaction-based
approach where we will always make sure that all nodes receive the new
generation before any of them starts using it (and if it's impossible
e.g. due to a network partition, we will fail the bootstrap attempt).
In practice, if the admin makes sure that the cluster works correctly
before bootstrapping a new node, and a network partition doesn't start
in the few seconds window where a new generation is announced, everything
will work as it should.

After the learning node retrieves the generation, it inserts it into an
in-memory data structure called "CDC metadata". This structure is then
used when performing writes to the CDC log -- given the timestamp of the
written mutation, the data structure will return the CDC generation
operating at this time point. CDC metadata might reject the query for
two reasons: if the timestamp belongs to an earlier generation, which
most probably doesn't have the colocation property anymore, or if it is
picked too far away into the future, where we don't know if the current
generation won't be superseded by a different one (so we don't yet know
the set of streams that this log write should be sent to). If the client
uses server-generated timestamps, the query will never be rejected.
Clients can also use client-generated timestamps, but they must make sure
that their clocks are not too desynchronized with the database --
otherwise some or all of their writes to CDC-enabled tables will be
rejected.

In the case of rolling upgrade, where we restart nodes that were
previously running without CDC, we act a bit differently - there is no
naturally selected joining node which must propose a new generation.
We have to select such a node using other means. For this we use a bully
approach: every node compares its host id with host ids of other nodes
and if it finds that it has the greatest host id, it becomes responsible
for creating the first generation.

This change also fixes the way of choosing values of the "time" column
of CDC log writes: the timeuuid is chosen in a way which preserves
ordering of corresponding base table mutations (the timestamp of this
timeuuid is equal to the base table mutation timestamp).

Warning: if you were running a previous CDC version (without topology
change support), make sure to disable CDC on all tables before performing
the upgrade. This will drop the log data -- backup it if needed.

TODO in future patchset: expire CDC generations. Currently, each inserted
CDC generation will stay in the distributed tables forever (until
manually removed by the administrator). When a generation is superseded,
it should become "expired", and 24 hours after expiration, it should be
removed. The distributed tables (cdc_topology_description and
cdc_description) both have an "expired" column which can be used for
this purpose.

Unit tests: dev, debug, release
dtests (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/907/
2020-02-04 10:20:29 +02:00
Gleb Natapov
2876482373 lwt: account for cases where LWT request were moved to another shard in statistics
Now that we bounce lwt requests to the correct shard before calling into
storage_proxy the cross shard op accounting does not account for bounced
lwt statement. Fix that by increasing corresponding counter when
returning a "bounce" reply.

Message-Id: <20200203122011.GH26048@scylladb.com>
2020-02-04 10:20:28 +02:00
Nadav Har'El
37f2f6112e cql3::util::maybe_quote: avoid stack overflow and fix quote doubling
Merged patch series from Benny Halevy:

The function was reimplemented to solve the following issues.
The cutom implementation also improved its performance in
close to 19%

Using regex_match("[a-z][a-z0-9_]*") may cause stack overflow on long input strings
as found with the limits_test.py:TestLimits.max_key_length_test dtest.

std::regex_replace does not replace in-place so no doubling of
quotes was actually done.

Add unit test that reproduces the crash without this fix
and tests various string patterns for correctness.

Note that defining the regex with std::regex::optimize
still ended up with stack overflow.

Fixes #5671

* cql3::util::maybe_quote: avoid stack overflow and fix quote doubling
* cql3::util::maybe_quote: further optimize quote doubling
2020-02-04 10:20:28 +02:00
Nadav Har'El
6e91f159fe LWT: handle bounce_to_shard result for batch statements
Merged patch series from Gleb Natapov:
Batch statement can also execute LWT and hence need to handle
 bounce_to_shard result.

* transport: handle bounce_to_shard for batch statement
* transport: consolidate bounce_to_shard handling between all three verbs that handle it
2020-02-04 10:20:28 +02:00
Takuya ASADA
1446fe930b dist/redhat: install specified version of scylla-conf on meta package (#5599)
To install specified version of scylla-conf package, we need to add it on Requires.

Fixes #5639
2020-02-04 10:20:28 +02:00
Benny Halevy
f45fabab73 gossiper: do_stop_gossiping: copy live endpoints vector
It can be resized asynchronously by mark_dead.

Fixes #5701

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200203091344.229518-1-bhalevy@scylladb.com>
2020-02-04 10:20:28 +02:00
Avi Kivity
501b24cad3 test.py: use command line option in preference to environment variable when calling a test
Command line options are printed out, so if a user cuts-and-pastes a
command line they will get a run that is more similar to the one that
the test executed.
Message-Id: <20200202133209.209608-1-avi@scylladb.com>
2020-02-04 10:20:28 +02:00
Rafael Ávila de Espíndola
1294770970 lua: Use a negative index for consistency
In this case we know the size of the stack and both indexes refer to
the same position. Using a negative index is just more consistent with
the rest of the file and hopefully a bit less brittle to future
changes.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-03 18:23:09 -08:00
Rafael Ávila de Espíndola
a4d668e8ed lua: Fix returning list<decimal>
We were accessing the wrong stack location if a decimal was not at top
of the stack.

Fixes: #5711

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-03 18:10:04 -08:00
Rafael Ávila de Espíndola
39e637f6bf lua: Fix returning list<varint>
We were accessing the wrong stack location if a varint was not at the
top of the stack.

Refs: #5711

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-03 18:09:59 -08:00
Rafael Ávila de Espíndola
530779efb6 lua: Use a lua_slice_state instead of a from_lua_visitor
A few places were using a from_lua_visitor only to access the
lua_slice_state member variable.

This is just a simplification. No functionality changed.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-03 18:04:36 -08:00
Rafael Ávila de Espíndola
35023c831c test: Enable UDF in the cql repl
A followup commit will use this to write cql tests for UDF.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-02-03 17:58:27 -08:00
Gleb Natapov
9c75a25e9f transport: consolidate bounce_to_shard handling between all three verbs that handle it
All three verbs that need to handle bounce_to_shard have almost
identical process_*() and process_*_on_shard() functions. Consolidate
them into one to reuse the code.
2020-02-03 14:27:50 +02:00
Gleb Natapov
dd793098fa transport: handle bounce_to_shard for batch statement
Batch statement can also execute LWT and hence need to handle
bounce_to_shard result.

Fixes: #5644
2020-02-03 14:27:30 +02:00
Pavel Emelyanov
8a7f13420f gossiper: Avoid string merge-split for nothing
The caller of check_knows_remote_features merges a set of
features into a string, but the method in question ... splits
them back into the set. Avoid this unneeded step and clean
the respective storage service helpers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
ca55c6c15f features: Stop on shutdown
The service in question doesn't depend on anything, so it's
started first and stopped last.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
f6f76ef8c1 storage_service: Remove helpers
The storage_service no longers works as features provider.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
0e62d615ae storage_service: Prepare to switch from on-board feature helpers
There are some places that get global storage_service instance
for individual features. In the next patch all these helpers
will be removed, so here's the preparation for it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
0abddc4557 cql3: Check feature in .validate
There's no local variable to get features from in the
create_view_statement constructor, but since the .validate
is always called after it, it looks safe to check for
needed feature in it (we have storage_proxy there).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
abe588888d database: Use feature service
Keep local feature_service reference on database. This relaxes the
circular storage_service <-> database reference, but not removes it
completely.

This needs some args tossing in apply_to_builder, but it's
rather straightforward, so comes in the same patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
12c1378be0 storage_proxy: Use feature service
Keep reference on local feature service from storage_proxy
and use it in places that have (local) storage_proxy at hands.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
4f5b70dcb1 migration_manager: Use feature service
This unties migration_manager from storage_service thus breaking
the circular dependency between these two.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
74fd3466b5 start: Pass needed feature as argument into migrate_truncation_records
As a nice side-effect this stops using global storage service
instance by this function.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
aa6b1efc35 features: Unfriend storage_service
The storage service no longer needs to mess with feature
config. It only needs two features to register onself in,
but this can be solved by respective cluster_supports_foo
helpers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
9b67226715 features: Simplify feature registration
Now features are registered into a map of vectors, but
it looks like the vector is always 1-item long and is
used to keep pointer on feature, instead of the feature
itself.

Switch it into map of reference_wrapper-s.

Before this patch we could register more than one
feature under the same name, now we can't. But this
seems to be OK, as we don't actually do this. To catch
violations of this restriction there's an assert() in the
feature_service::register_feature.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
da6af8bde7 features: Introduce known_feature_set
There are two masks -- supported and known. They differ in
unbounded_range_tombstones one which is set depending on the
sstables format in use.

Since the feature_service doesn't know anything about sstables
format, the logic is reverted -- the feature service reports
back the known mask (all features) and storage_service clears
the unbounded_range_tombstones if the sst format is low -- but
is (hopefully) left intact.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
4a01f468dd features: Move disabled features set from storage_service
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
a5b1998247 features: Move schema_features helper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
b0638606e5 features: Move all features from storage_service to feature_service
And leave some temporary storage_service->feature links. The plan
is to make every subsystem that needs storage_service for features
stop doing so and switch on the feature_service.

The feature_service is the service w/o any dependencies, it will be
freed last, thus making the service dependency tree be a tree,
not a graph with loops.

While at it -- make all const-s not have _FEATURE suffix (now there
are both options) and const-qualify cluster_supports_lwt().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
49de3b4ad8 storage_service: Use feature_config from _feature_service
This makes the testing/prod config logic much simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
052259f8ef features: Add feature_config
Some features take db::config to find out whether to be enabled
or disabled. This creates unwanted dependency between database and
features, so split the features configuration explicitly. Also
this will make the "this is for testing env only" logic cleaner
and simpler to understand.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
d38f8ca52a storage_service: Kill set_disabled_features
The _disabled_features is configured by tests via storage_service
constructor, so the helper in question is effectively useless.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:23 +03:00
Pavel Emelyanov
76a7fd4186 gms: Move features stuff into own .cc file
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 15:16:21 +03:00
Kamil Braun
4b3754ff94 docs: add documentation about CDC generations 2020-02-03 10:57:31 +01:00
Kamil Braun
b130b76274 test: disable CDC flag by default
When CDC flag is on, the node startup procedure takes a few seconds
longer (we have to generate CDC streams). This is not necessary in
non-CDC tests.
2020-02-03 10:57:31 +01:00
Kamil Braun
0d41e2c1fe test: add cdc::generate_timeuuid tests 2020-02-03 10:57:31 +01:00
Kamil Braun
5fb5925fb4 test: add cdc::find_timestamp tests 2020-02-03 10:57:31 +01:00
Kamil Braun
7cb6ac33f5 storage_service: check if we know other nodes' tokens when joining ring
If we are a seed node (but not the only one) or we set
auto_bootstrap=off, it might happen due to misconfiguration or a network
partition that we don't know other nodes' tokens at the end of the
join_token_ring function, when we go into the NORMAL status, finishing
the joining process.

CDC however requires that we know other nodes' tokens at this point:
we need them to correctly create a new CDC generation.

This commit adds a check which prevents the node from starting if that's
not the case. If the check fails, the node first tries waiting a bit until
it learns about the tokens or timeouts.
2020-02-03 10:57:28 +01:00
Pavel Emelyanov
7a2123c8dc migration_manager: Move some fns into class
These methods will need to have this-> in one of the
next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-02-03 12:29:54 +03:00
Avi Kivity
2816404f57 test.py: documented exit code value
Document our chosen exit failure code value and its relationship
to git bisect.
Message-Id: <20200202134223.210578-1-avi@scylladb.com>
2020-02-03 00:58:58 +02:00
Avi Kivity
541893e69a Merge "Fix conversion of lua nil to cql null" from Rafael
"
The fix itself is fairly simple, but looking at the code I found that
our code base was not cleanly distinguishing null and empty values and
was treating null and missing values differently, but that distinction
was dead since a null is represented as a dead cell.
"

* 'espindola/lua-fix-null-v6' of https://github.com/espindola/scylla:
  lua: Handle nil returns correctly
  types: Return bytes_opt from data_value::serialize
  query-result-set: Assert that we don't have null values
  types: Fix comparison of empty and null data_values
  Revert "tests: Handle null and not present values differently"
  query-result-set: Avoid a copy during construction
  types: Move operator== for data_value out-of-line
2020-02-02 15:43:24 +02:00
Avi Kivity
c8890eb124 Merge "Simplify usage of stream subscriptions" from Rafael
"
In a few places, the only use we had for a subscription was calling
done(). With this series we now call done() early and store the
future<> instead.
"

* 'espindola/stream-cleanup' of https://github.com/espindola/scylla:
  sstable_test: Store a future<> instead of a subscription
  commitlog: Store a future instead of a subscription in db::commitlog::segment_manager::list_descriptors::helper
  lister: Store a future<> instead of a subscription
2020-02-02 14:49:00 +02:00
Rafael Ávila de Espíndola
5dfb658e77 build: Add two missing dependencies
With this change we always rebuild seastar/libseastar_testing.a for
the same reason we always rebuild seastar/libseastar.a: We have no
idea what its dependencies are, we have to recurse to seastar to find
out.

The other missing dependency is that we have to rebuild build.ninja
when seastar/CMakeLists.txt changes. A change in
seastar/CMakeLists.txt can cause seastar.pc to change which can change
the command lines used.

That is incomplete as change other seastar files can have the same
impact, but it is better than nothing.

It is not sufficient to put a dependency in the seastar.pc file as
that file will be modified when cmake is run and the scylla ninja
process doesn't see the CMakeLists.txt to seastar.pc edge.

Fixes: #5687

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200201001126.458992-1-espindola@scylladb.com>
2020-02-01 21:08:26 +02:00
Pavel Emelyanov
4839ca8491 storage_service: Unregister from gossiper notifications ... at all
This unregistration doesn't happen currently, but doesn't seem to
cause any problems in general, as on stop gossiper is stopped and
nothing from it hits the store_service.

However (!) if an exception pops up between the storage_service
is subscribed on gossiper and the drain_on_shutdown defer action
is set up  then we _may_ get into the following situation:

- main's stuff gets unrolled back
- gossiper is not stopped (drain_on_shutdown defer is not set up)
- migration manager is stopped (with deferred action in main)
- a nitification comes from gossiper
    -> storage_service::on_change might want to pull schema with
       the help of local migration manager
    -> assert(local_is_initialized) strikes

Fix this by registering storage_service to gossiper a bit earlier
(both are already initialized y that time) and setting up unregister
defer right afterwards.

Test: unit(dev), manual start-stop
Bug: #5628

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200130190343.25656-1-xemul@scylladb.com>
2020-01-31 14:02:18 +01:00
Avi Kivity
ec5b721db7 test: make eventually() more patient
We use eventually() in tests to wait for eventually consistent data
to become consistent. However, we see spurious failures indicating
that we wait too little.

Increasing the timeout has a negative side effect in that tests that
fail will now take longer to do so. However, this negative side effect
is negligible to false-positive failures, since they throw away large
test efforts and sometimes require a person to investigate the problem,
only to conclude it is a false positive.

This patch therefore makes eventually() more patient, by a factor of
32.

Fixes #4707.
Message-Id: <20200130162745.45569-1-avi@scylladb.com>
2020-01-31 14:02:18 +01:00
Dejan Mircevski
6661ed7de4 cql3: Drop restrictions::values() method
No-one seems to invoke this method.  Instead, clients invoke
restriction::values (note singular "restriction").  Most subclasses of
restrictions also inherit from restriction, so values() still exists
in their public interface.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-01-31 13:05:51 +01:00
Avi Kivity
985e00efa6 Merge "Fix the serialization of negative varint values" from Rafael
"
Benny pointed out that we could avoid a branch inside a loop is the
old serialization code. That got me looking at the logic and I found
that it would also produce an unnecessary 0xff prefix for some
negative numbers.

This patch series fixes the serialization and optimizes it. It now
does no extra copies for positives numbers and only one extra copy for
negative numbers, which I think is optimal since cpp_int uses sign
magnitude and we want the 2 complement representation.
"

* 'espindola/serialize_varint-improvements-v2' of https://github.com/espindola/scylla:
  types: Use a fancy iterator to avoid a temporary buffer
  types: Use export_bits to serialize cpp_int
  types: Avoid a branch in a loop
  types: Fix encoding of negative varint
  types: Replace "num.sign() < 0" with "num < 0"
2020-01-30 20:35:54 +02:00
Rafael Ávila de Espíndola
cc81ba3432 types: Use a fancy iterator to avoid a temporary buffer
By using a fancy iterator we can avoid calling export_bits with a
temporary buffer before copying the result to the output.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-30 10:26:39 -08:00
Rafael Ávila de Espíndola
7e67ce0bdb types: Use export_bits to serialize cpp_int
This avoid a copy when serializing positive numbers.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-30 10:26:39 -08:00
Rafael Ávila de Espíndola
27a67f1a2c types: Avoid a branch in a loop
Thanks to Benny for the suggestion.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-30 10:26:39 -08:00
Rafael Ávila de Espíndola
c89c90d07f types: Fix encoding of negative varint
We would sometimes produce an unnecessary extra 0xff prefix byte.

The new encoding matches what cassandra does.

This was both a efficiency and correctness issue, as using varint in a
key could produce different tokens.

Fixes #5656

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-30 10:25:09 -08:00
Rafael Ávila de Espíndola
ed747122aa types: Replace "num.sign() < 0" with "num < 0"
Surprisingly, this produces better code with cpp_int.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-30 10:24:03 -08:00
Rafael Ávila de Espíndola
cc9495d4d3 sstable_test: Store a future<> instead of a subscription
The only use we had for the subscription was calling done, may as well
call it early and store the future<>.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-30 08:31:28 -08:00
Rafael Ávila de Espíndola
da984f1f33 commitlog: Store a future instead of a subscription in db::commitlog::segment_manager::list_descriptors::helper
The only use we had for the subscription was calling done, may as well
call it early and store the future<>.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-30 08:31:28 -08:00
Rafael Ávila de Espíndola
b88f6edee0 lister: Store a future<> instead of a subscription
The only use we had for the subscription was calling done, may as well
call it early and store the future<>.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-30 08:31:28 -08:00
Gleb Natapov
b08679e1d3 db/system_keyspace: use user memory limits for local.paxos table
Treat writes to local.paxos as user memory, as the number of writes is
dependent on the amount of user data written with LWT.

Fixes #5682

Message-Id: <20200130150048.GW26048@scylladb.com>
2020-01-30 17:07:27 +02:00
Piotr Sarna
b783d40aaf Merge 'Add per scheduling groups statistics' from Eliran
This set implements support for per scheduling group statistics in
storage proxy and tables view statistics (although tables view per
scheduling group stats are not actively applied in this series).
Having those statistics per scheduling group can help in finding operations
that are performed outside their context, another advantage is that
it lays the land for supporting per service level statistics for the
workload prioritization enterprise feature.
At some point there was a thought to add those stats per role but
for now it is not feasible at the moment:
1. The number of roles/user is unbounded so it is dangerous to
hold stats (in memory) for all of them.
2. We will need a proper design of how to deal with the hierarchical
nature of roles in the stats.

Besides these reasons and regardless, it is beneficial to look on
resource related stats per scheduling group, looking at resources
per user or role will not necessarily give insights since resources
are divided per sg and not role, so it can lead to false conclusions
if more than one role is attached to the same service level.

Tests:
unit tests (Dev, Debug)
validating the stats with monitor

* es/per_sg_stats/v6:
  storage proxy: migrate to per scheduling group statistics
  internalize storage proxy statistics metric registration
2020-01-30 15:02:33 +01:00
Eliran Sinvani
971711a546 storage proxy: migrate to per scheduling group statistics
This commit builds on top of the introduced per scheduling group
statistics template and employs it for achieving a per scheduling
group statistics in storage_proxy.

Some of the statistics also had meaning as a global - per
shard one. Those are the ones for determining if to
throttle the write request. This was handled by creating a
global stats struct that will hold those stats and by changing
the stat update to also include the global one.

One point that complicated it is an already existing aggregation
over the per shard stats that now became a per scheduling group
per shard stats, converting the aggregation to a two-dimensional
aggregation.

One thing this commit doesn't handle is validating that an individual
statistic didn't "cross a scheduling group boundary", such validation
is possible but it can easily be added in the future. There is a
subtlety to doing so since if the operation did cross to other
scheduling group two connected statistics can lose balance
for example written bytes and completed write transactions.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2020-01-30 15:01:44 +01:00
Eliran Sinvani
8cfc2aad57 internalize storage proxy statistics metric registration
The storage proxy statistics structure did not contain
a method for registering the statistics for metric
groups, instead, each user had to register some
of the metrics by itself. There is no real reason
for separating the metrics registration from
the statistics data. There is even less justification
for doing this only for part of the stats as is
the case for those statistics.
This commit internalize the metrics registration
in the storage_proxy stats structures.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2020-01-30 15:01:40 +01:00
Gleb Natapov
c138dfd33e lwt: introduce LWT gossiper feature
Do not allow lwt operation if LWT is not enabled by entire cluster.

Message-Id: <20200130120912.GV26048@scylladb.com>
2020-01-30 15:12:56 +02:00
Benny Halevy
606db0d412 cql3::util::maybe_quote: further optimize quote doubling
Avoid string copies when doubling quotes in the string
by counting them when scanning the input string and
reserving the required space when making the result std::string.

This showed a performance improvement of ~1.8% when
running the maybe_quote unit test in tight loop
(w/ the shorter strings only)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-30 14:55:51 +02:00
Rafael Ávila de Espíndola
a16cb00719 configure: Don't use -Wno-error when building seastar
This depends on the recent patches to avoid warnings in seastar.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200127210833.200410-1-espindola@scylladb.com>
2020-01-30 14:10:18 +02:00
Avi Kivity
09e2556541 Update seastar submodule
* seastar 44cf127ee9...65980a9b30 (2):
  > io_tester: fix the fix for lack of file closing
  > cmake: Disable broken gcc warning -Warray-bounds
2020-01-30 14:10:18 +02:00
Avi Kivity
b01f0cab60 utils: add missing include for ssize_t
gcc 10 tightened its C++ includes to no longer provide ssize_t,
so we must get it from a C header instead.
Message-Id: <20200129205912.21139-1-avi@scylladb.com>
2020-01-30 14:10:18 +02:00
Avi Kivity
adb64dc72f treewide: tighten concepts syntax
gcc 10 requires a semicolon after every compound requirement,
as per the standard. Add missing semicolons where necessary.
Message-Id: <20200129205805.20928-1-avi@scylladb.com>
2020-01-30 14:10:18 +02:00
Rafael Ávila de Espíndola
4b4efcf302 types: Remove collection_type_impl::serialize
The rest of the serialize api has been devirtualized some time ago,
but this auxiliary function stayed virtual.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200129203916.20460-1-espindola@scylladb.com>
2020-01-30 14:10:18 +02:00
Kamil Braun
bd42b10df1 cdc: rename cdc/cdc.{hh,cc} to cdc/log.{hh,cc}
To increase modularity, making it easier to find what is where and
maintain.

The 'log' module (cdc/log.{hh,cc}) is responsible for updating CDC log
tables when base table writes are performed.

The 'generation' module (cdc/generation.{hh,cc}) handles stream
generation changes in response to topology change events.

cdc/metadata.{hh,cc} contains a helper class which holds the currently
used generation of streams. It is used by both aforementioned modules:
'log' queries it, while 'generation' updates it.
2020-01-30 11:10:39 +01:00
Kamil Braun
1a56310687 locator: remove get_shard_count and get_ignore_msb_bits from snitch
Snitch forms a class hierarchy which get_shard_count and
get_ignore_msb_bits ignore (their returned values only depend on the
gossiper's state).

Besides, these functions just don't belong there.
Snitch has nothing to do with shard_count or ignore_msb_bits.
2020-01-30 11:10:08 +01:00
Kamil Braun
e91af78cf5 cdc: update streams description table
Inform CDC users about newly generated streams.
2020-01-30 11:10:08 +01:00
Kamil Braun
cbe510d1b8 cdc: use stream generations
Change the CDC code to use the global CDC stream generations.

The per-base-table CDC description table was removed. The code instead
uses cdc::metadata which is updated on gossip events.

The per-table description tables were replaced by a global description
table to be used by clients when searching for streams.
2020-01-30 11:10:08 +01:00
Kamil Braun
8f4a2ba0b9 storage_service: learn about CDC stream generations.
When a node learns that another node joins the cluster (or begins
the joining process, i.e. bootstrap), it will read the CDC generation
timestamp proposed by that node, use it to retrieve the generation from the
distributed generations table, and save it in its local generation queue
to be used for writing to the CDC log when its local clock crosses
the generation's timestamp.

The CDC generation is saved in the queue before tokens are saved in
token_metadata. This is important so that when the node becomes
a coordinator of a write, it will already have all the necessary
information required to generate a corresponding CDC log mutation.

After joining, nodes should keep gossiping their proposed stream
generation timestamps forever, until they learn about a newer timestamp,
in which case they'll start gossiping the new timestamp.

There is one case where a node won't gossip such any generation timestamp:
if it's upgrading from a non-CDC version.
In this situation we make one of the nodes begin the first generation.
2020-01-30 11:10:08 +01:00
Kamil Braun
834c2ca997 cdc: add cdc::metadata class
The class stores a queue of CDC generations to be used for choosing
streams when writing to the CDC log.

This data structure will be updated on some gossip events (when a new node
joins the cluster and proposes a new generation of CDC streams).
2020-01-30 11:10:08 +01:00
Kamil Braun
86af2a63ec clocks: add printing functions
For debugging and logging.
2020-01-30 11:10:08 +01:00
Kamil Braun
34e4ce275d storage_service: restore CDC streams timestamp when replacing a node
When a node is replacing another node it will keep gossiping its CDC
streams generation timestamp.
2020-01-30 11:10:08 +01:00
Kamil Braun
a6e62dba95 cdc: add get_streams_timestamp_for(endpoint) method
In future commits this will be used by nodes learning about other nodes
entering NORMAL status. The joining node proposes a new generation of streams,
whose timestamp is gossiped by the node.
2020-01-30 11:10:08 +01:00
Kamil Braun
37ae37db38 storage_service: move get_application_state_value method to gossiper 2020-01-30 11:10:08 +01:00
Kamil Braun
b44c63a127 storage_service: small refactors in prepare_replacement_info 2020-01-30 11:10:08 +01:00
Kamil Braun
32f4489a18 storage_service: generate CDC streams generation and gossip its timestamp.
Generate a new generation of streams during bootstrap,
insert it into an internal distributed table for other nodes to read
and save its timestamp in the system.local table.

When restarting, read the generation timestamp from the system.local table.

Gossip the generation timestamp.
2020-01-30 11:10:08 +01:00
Kamil Braun
19f23c6de1 cdc: add cdc-related node startup functions 2020-01-30 11:10:08 +01:00
Kamil Braun
96e5d6c924 token_metadata: add count_normal_token_owners method 2020-01-30 11:10:08 +01:00
Kamil Braun
52d71832f8 gossiper: make some methods const 2020-01-30 11:10:08 +01:00
Kamil Braun
3ae7b6cbc4 versioned_value: add cdc_streams_timestamp
This will be used to inform other nodes that a new CDC streams
generation has been created.
2020-01-30 11:10:08 +01:00
Kamil Braun
7fa30f6f34 db: add a system.cdc_local table with CDC generation timestamp
This will be used to persist CDC streams generation timestamp
proposed by a joining node in case the node crashes or restarts,
similarly to the way tokens are persisted.

The get_saved_cdc_streams_timestamp method retrieves the generation
timestamp from the system table. It will be used by a restarting
node.

The update_cdc_streams_timestamp method saves CDC stream
generation timestamp of the calling node in the system table.
A joining node will persist the timestamp before it proposes it to other
nodes.
2020-01-30 11:10:08 +01:00
Piotr Jastrzebski
04fe18de0f system_distributed_keyspace: add cdc-related tables
The cdc_topology_description table will be used internally
by nodes to send new CDC stream generations to other nodes.

The cdc_description table is a user-facing table,
used to inform users about new sets of CDC streams.

Regenerate sstables and digests for schema_change_test.
We don't need to protect this change by a schema feature:
when a node creates these tables, it announces them
to all other nodes. If schema agreement happens before
this migration, all nodes will use a digest calculated
without these tables. If it happens after, then all nodes
will eventually know about these tables and use a digest
calculated with these tables.
2020-01-30 11:10:08 +01:00
Piotr Jastrzebski
9fa18c03c1 cdc: add generate_topology_description
cdc::topology_description describes a mapping of tokens to CDC streams.

The cdc::generate_topology_description function is given:
1. a set of tokens which split the token ring into token ranges (vnodes),
2. information on how each token range is distributed among its owning
   node's shards
and tries to generate a set of CDC stream identifiers such that for each
shard and vnode pair there exists a stream whose token falls into this
vnode and is owned by this shard.

It then builds a cdc::topology_description which maps tokens to these
found stream identifiers, such that if token T is owned by shard S in
vnode V, it gets mapped to the stream identifier generated for (S, V).
2020-01-30 11:10:07 +01:00
Piotr Jastrzebski
a3748f942e cdc: add topology_description class
This is a class that will be used for storing information
required to perform CDC operations, i.e. assignment of token ranges to
CDC streams.

It is serializable to bytes and will be stored
in such a form in a distributed table accessible
by all nodes.
2020-01-30 11:10:07 +01:00
Kamil Braun
36ee36618a dht: add i_partitioner::shard_of(token, shard_count, ignore_msb) method
Allows calculating the shard of the given token using custom values of
shard_count and sharding_ignore_msb (instead of the ones used by the
particular partitioner instance).
2020-01-30 11:10:07 +01:00
Kamil Braun
f4f8593bac dht/murmur3_partitioner: take private methods out of the class
The methods were made static functions of the murmur3_partitioner
module.
2020-01-30 11:09:48 +01:00
Benny Halevy
0329fe1fd1 cql3::util::maybe_quote: avoid stack overflow and fix quote doubling
The function was reimplemented to solve the following issues.
The cutom implementation also improved its performance in
close to 19%

Using regex_match("[a-z][a-z0-9_]*") may cause stack overflow on long input strings
as found with the limits_test.py:TestLimits.max_key_length_test dtest.

std::regex_replace does not replace in-place so no doubling of
quotes was actually done.

Add unit test that reproduces the crash without this fix
and tests various string patterns for correctness.

Note that defining the regex with std::regex::optimize
still ended up with stack overflow.

Fixes #5671

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-30 12:00:30 +02:00
Rafael Ávila de Espíndola
e4b8f52237 commitlog: Simplify the return of read_log_file
This function really just wants to signal it is done, so return a
future<>.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200128172847.31513-1-espindola@scylladb.com>
2020-01-30 12:00:29 +02:00
Gleb Natapov
67deab0661 test: fix cql_repl to be able to run lwt tests on smp
Handle bounce_to_shard result properly in cql_repl.

Message-Id: <20200129122547.GO26048@scylladb.com>
2020-01-30 11:37:27 +02:00
Konstantin Osipov
4d3423b983 test.py: add a help file
Message-Id: <20200128210426.24509-2-kostja@scylladb.com>
2020-01-30 11:05:02 +02:00
Avi Kivity
5842833d62 test.py: change test failure exit code to be more friendly to git bisect
test.py returns -1 on failure; exit() translates that to 255, which git
bisect interprets as a special exit code requiring manual intervention.

Change to return the more traditional 1 on failure, which git bisect
can interpret as a normal failure condition.
Message-Id: <20200130084950.4186598-1-avi@scylladb.com>
2020-01-30 11:02:22 +02:00
Rafael Ávila de Espíndola
090164791c logalloc: Store unused ids in a std::vector
There doesn't seem to be any requirement for how unused ids are
reused, so we may as well use the simpler type.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200129211154.47907-1-espindola@scylladb.com>
2020-01-30 10:31:16 +02:00
Rafael Ávila de Espíndola
bd7593eab3 lua: Handle nil returns correctly
With this patch lua nil values are mapped to CQL null values instead
of producing an error.

Fixes #5667

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-29 14:05:01 -08:00
Rafael Ávila de Espíndola
bd93a0af52 types: Return bytes_opt from data_value::serialize
Since a data_value can contain a null value, returning bytes from
serialize() was losing information as it was mapping null to empty.

This also introduces a serialize_nonnull that still returns bytes, but
results in an internal error if called with a null value.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-29 14:04:59 -08:00
Avi Kivity
5137b596f8 build_id: add missing include for assert()
build_id.cc uses assert() but doesn't include the header.

Reviewed-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200129205515.20406-1-avi@scylladb.com>
2020-01-29 23:44:50 +02:00
Rafael Ávila de Espíndola
2b45edd97e query-result-set: Assert that we don't have null values
Null values are represented with dead cells and never included in a
result_set. To enforce that, this adds a non_null_data_value that
wraps a data_value and whose constructor calls on_internal_error if a
null data_value is passed.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-29 13:24:10 -08:00
Rafael Ávila de Espíndola
3abac35d9f types: Fix comparison of empty and null data_values
Before this patch a null data_value would compare equal to any
data_value that serialized to an empty byte sequence.

With this patch null only compares equal to null.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-29 13:24:10 -08:00
Rafael Ávila de Espíndola
9031294ea9 Revert "tests: Handle null and not present values differently"
This reverts commit 2ebd1463b2.

The test introduced by that commit was wrong, and in fact depended on
a bug in operator== for data_value. A followup patch fixes operator==,
so this reverts the broken commit first.

The reason it was broken was that it created a live cell with a null
data_value. In reality, null values are represented with dead cells.

For example, the sstable produced by

CREATE TABLE my_table (key int PRIMARY KEY, v1 int, v2 int) with compression = {'sstable_compression': ''};
INSERT INTO my_table (key, v1, v2) VALUES (1, 42, null);

Is

    00 04                   key_length
    00 00 00 01             key
    7f ff ff ff             local_deletion_time
    80 00 00 00 00 00 00 00 marked_for_delete_at
    24                      HAS_ALL_COLUMNS | HAS_TIMESTAMP

    09                      row_body_size
    12                      prev_unfiltered_size
    00                      delta_timestamp

    08                      USE_ROW_TIMESTAMP_MASK
    00 00 00 2a             value
    0d                      USE_ROW_TIMESTAMP_MASK | HAS_EMPTY_VALUE_MASK | IS_DELETED_MASK
    00                      deletion time
    01                      END_OF_PARTITION

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-29 13:24:10 -08:00
Rafael Ávila de Espíndola
66290c3bb9 query-result-set: Avoid a copy during construction
No functionality change.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-29 13:24:10 -08:00
Rafael Ávila de Espíndola
02e8e8d6b3 types: Move operator== for data_value out-of-line
Most of the work is done by decompose and compare which are
out-of-line anyway.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-29 13:24:10 -08:00
Piotr Sarna
d13492485f alternator: restore Python2 compatibility for test_tag
... by explicitly declaring utf-8 encoding.
Message-Id: <e99789876176cf722ccfc297621338dc93843588.1580301449.git.sarna@scylladb.com>
2020-01-29 18:11:47 +02:00
Nadav Har'El
ce0c9c1044 merge: add tagging to alternator
Merged patch series from Piotr Sarna:

This series adds the following to alternator:
 - TagResource request
 - UntagResource request
 - ListTagsOfResource request
 - Honoring "Tags" parameter in CreateTable

It also provides more tests for above features and extended docs.
Tagging is backed by a schema extension, which is in turn backed
by entries in system_schema.tables.extensions map.

Tags are considered part of the schema, and in particular
they are updated via an equivalent of:
ALTER TABLE table WITH scylla_tags = {'key1':'v1', 'key2':'v2'}
Each tag change is therefore a schema change, which also means
that editing tags for the same table on different nodes may be
subject to races, until the schema agreement issues are resolved
in Scylla.

Fixes #5066
Tests: alternator-test(local, remote)

Piotr Sarna (6):
  alternator,main: add tags schema extension
  alternator: add creating values from string views
  alternator: implement tagging
  alternator: allow tagging on table creation
  docs: add entries for alternator tags and arn
  alternator-test: make test tables case sensitive

 alternator-test/test_tag.py   |  63 ++++++++++-
 alternator-test/util.py       |   2 +-
 alternator/executor.cc        | 191 ++++++++++++++++++++++++++++++++--
 alternator/executor.hh        |   3 +
 alternator/rjson.cc           |   4 +
 alternator/rjson.hh           |   1 +
 alternator/server.cc          |   3 +
 alternator/tags_extension.hh  |  52 +++++++++
 docs/alternator/alternator.md |  14 ++-
 main.cc                       |   5 +
 10 files changed, 325 insertions(+), 13 deletions(-)
 create mode 100644 alternator/tags_extension.hh
2020-01-29 18:11:47 +02:00
Botond Dénes
69f606baa0 database: check timout before applying writes
Attempting to apply timed-out writes is a wasted effort. The coordinator
have already given up on the write and reported it as failed to the
client. Any cycles spent on this write is a waste at this point.
We currently only check the timeout if the write is blocked on memory,
otherwise, if the system is not under pressure, we will happily apply
timed out writes. If the system is under pressure we will make it worse
by wasting cycles on processing a timed out write.

Prevent this by checking the timeout as early as possible in
`database::apply()` and `database::apply_counter_update()`.

This patch doesn't solve all our problems related to timed out writes.
They can still sit and accumulate in various queues without expiring, a
prominent example being the smp queues. It is however a good first step
towards reducing wasted effort spent on them.

Refs: #5055
Ref #5251

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200129093007.550250-1-bdenes@scylladb.com>
2020-01-29 13:08:43 +02:00
Gleb Natapov
c654ffe34b commitlog: fix flushing an entry marked as "sync" in periodic mode
After 546556b71b we can have mixed writes into commitlog,
some do flush immediately some do not. If non flushing write races with
flushing one and becomes responsible for writing back its buffer into a
file flush will be skipped which will cause assert in batch_cycle() to
trigger since flush position will not be advanced. Fix that by checking
that flush was skipped and in this case flush explicitly our file
position.

Fixes #5670

Message-Id: <20200128145103.GI26048@scylladb.com>
2020-01-29 12:58:25 +02:00
Piotr Sarna
93d8612a49 alternator-test: make test tables case sensitive
In order to test case sensitivity, test table names
now contain a capital letter.
2020-01-29 10:21:35 +01:00
Piotr Sarna
f8c1c82149 docs: add entries for alternator tags and arn
Support for tagging and arn was added already, so the documentation
is properly extended.
2020-01-29 10:20:05 +01:00
Piotr Sarna
668e15643d alternator: allow tagging on table creation
During table creation, it's now possible to provide a 'Tags' parameter,
which will add tags to a newly created table. Note that creating a table
and tagging it is not atomic, so in case of failure it's possible to end
up with a created table, but without appropriate tags.
This commit comes with a test.
Message-Id: <00c2e202e9075d2c61e4ee5ba322ff4d5dbe718c.1579618972.git.sarna@scylladb.com>
2020-01-29 10:20:05 +01:00
Piotr Sarna
4c9f2f3c0a alternator: implement tagging
The following requests are implemented:
 - TagResource
 - UntagResource
 - ListTagsOfResource

Also, more tests are added for validating inputs, for both
arns, tag values and tag keys.

Message-Id: <a7ce9534ca580736fea445813fafef75a6139e29.1579618972.git.sarna@scylladb.com>
2020-01-29 10:20:05 +01:00
Piotr Sarna
ea04b7fb04 alternator: add creating values from string views
An additional override for rjson::from_string() is added for
a std::string_view type.
Message-Id: <3552ac3347b6a79dd22ca1215c831808450b1ef8.1579618972.git.sarna@scylladb.com>
2020-01-29 10:20:05 +01:00
Piotr Sarna
16688efad7 alternator,main: add tags schema extension
A schema extension is introduced for alternator - tags.
This schema extension can be used to store arbitrary tags for a table,
in the form of a map<text, text>.
Updating tags for a table is equivalent to the following CQL query:
ALTER TABLE table WITH scylla_tags = {'key1':'v1', 'key2':'v2'}

The extension, as all other extensions, is backed by the entry
in the system_schema.tables table.
2020-01-29 10:20:05 +01:00
Pavel Solodovnikov
f2feeb4b10 cql3: Propagate "const" to some virtual methods in cql hierarchy
Add "const" attributes to `assignment_testable::test_assignment`
and `term::raw::prepare` methods. These should have been marked as
"const" even before the change but for some reason were missing
these qualifiers.

Mark other supplementary methods with "const" attributes as
necessary.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200127213215.494000-1-pa.solodovnikov@scylladb.com>
2020-01-29 00:23:40 +02:00
Avi Kivity
3343baf159 Merge "cql3: time_uuid_fcts: validate time UUID" from Benny
"
Throw an error in case we hit an invalid time UUID
rather than hitting an assert.

Fixes #5552

(Ref #5588 that was dequeued and fixed here)

Test: UUID_test, cql_query_test(debug)
"

* 'validate-time-uuid' of https://github.com/bhalevy/scylla:
  cql3: abstract_function_selector: provide assignment_testable_source_context
  test: cql_query_test: add time uuid validation tests
  cql3: time_uuid_fcts: validate timestamp arg
  cql3: make_max_timeuuid_fct: delete outdated FIXME comment
  cql3: time_uuid_fcts: validate time UUID
  test: UUID_test: add tests for time uuid
  utils: UUID: create_time assert nanos_since validity
  utils/UUID_gen: make_nanos_since
  utils: UUID: assert UUID.is_timestamp
2020-01-29 00:11:17 +02:00
Avi Kivity
ec1687e4fe Merge "Remove deprecated partitioners #5636" from Piotr
"
This PR makes named_value respect allowed_values and then use it to transition away from old deprecated RandomPartitioner and ByteOrderedPartitioner. Then it removes the code that's no longer used.

We want to remove deprecated partitioners because, on one hand, they lead to performance problems and hot nodes. Moreover, we're planning to unify the token representation which would allow per table partitioner support. That, in turn, is a feature helpful in multiple efforts like CDC, materialized views, secondary indexes and multi-tenancy.

tests: unit(dev)
"

* 'remove_deprecated_partitioners' of https://github.com/haaawk/scylla:
  partitioners: remove random_partitioner
  partitioners: Make it impossible to use RandomPartitioner
  partitioners: remove byte_ordered_partitioner
  partitioners: Make it impossible to use ByteOrderedPartitioner
  partitioners: Remove leftovers of OrderPreservingPartitioner
  i_partitioner.cc: stop including byte_ordered_partitioner.hh
  i_partitioner.cc: stop including random_partitioner.hh
  config: use allowed_values to verify named_value input
  config: add operator<< for seed_provider_type
2020-01-29 00:11:17 +02:00
Avi Kivity
652d8a9b84 install-dependencies.sh: add lld
Since we now default to lld if present, and since lld is a faster
linker than either ld or gold, it makes sense to install it
as a dependency and to make it available as part of the frozen
toolchain.
2020-01-29 00:11:17 +02:00
Avi Kivity
17eaf552f0 Merge "Improve the accuracy of reader memory tracking" from Botond
"
Grab the lowest hanging fruits.

This patch-set makes three important changes:
* Consume the memory for I/O operations on tracked files, *before* they
  are forwarded to the underlying file.
* Track memory consumed by buffers created for parsing in
  `continuous_data_consumer`. As this is the basis for the data, index
  and promoted index parsers, all three are covered now in this regard.
* Track the index file.

The remaining, not-so-low handing fruits in order of
gain/cost(performance) ratio:
* Track in-memory index lists.
* Track in-memory promoted index blocks.
* Track reader buffer memory.

Note that this ordering might change based on the workload and other
environmental factors.

Also included in this series is an infrastructure refactoring to make
tracking memory easier and involve including lighter headers, as well as
a manual test designed to allow testing and experimenting with the
effects of changes to the accuracy of the tracking of reader memory
consumption.

Refs: #4176
Refs: #2778

Tests: unit(dev), manual(sstable_scan_footprint_test)

The latter was run as:
build/dev/test/manual/sstable_scan_footprint_test -c1 -m2G --reads=4000
--read-concurrency=1 --logger-log-level test=trace --collect-stats
--stats-period-ms=20

This will trickle reads until the semaphore blocks, then wait until the
wait queue drains before sending new reads. This way we are not testing
the effectiveness of the pre-admission estimation (which is terribly
optimistic) and instead check that with slowly ramping up read load the
semaphore will block on memory preventing OOM.
This now runs to completion without a single `std::bad_alloc`. The read
concurrency semaphore allows between 15-30 reads, and is always blocked
on memory.
"

* 'more-accurate-reader-resource-tracking/v1' of ssh://github.com/denesb/scylla:
  test/manual/sstable_scan_footprint_test: improve memory consumption diagnostics
  tests/manual/sstable_scan_footprint_test: use the semaphore to determine read rate
  tests/manual: Add test measuring memory demand of concurrent sstable reads
  index_reader: make the index file tracked
  sstables/continuous_data_consumer: track buffers used for parsing
  reader_concurrency_semaphore: tracking_file_impl: consume memory speculatively
  reader_concurrency_semaphore: bye reader_resource_tracker
  treewide: replace reader_resource_tracer with reader_permit
  reader_permit: expose make_tracked_temporary_buffer()
  reader_permit: introduce make_tracked_file()
  reader_permit: introduce memory_units
  reader_concurrency_semaphore: mv reader_resources and reader_permit to reader_permit.hh
  reader_concurrency_semaphore: reader_permit: make it a value type
  reader_concurrency_semaphore: s/resources/reader_resources/
  reader_concurrency_semaphore::reader_permit: move methods out-of-line
2020-01-29 00:11:17 +02:00
Gleb Natapov
8dc37277df commitlog: remove unused variable
Message-Id: <20200128132118.GH26048@scylladb.com>
2020-01-29 00:11:17 +02:00
Eliran Sinvani
57f90e34ea alternator: run alternator processing loop in the statement scheduling group
In Scylla all query processing activity should run under the
"statement" scheduling group. The scheduling group is
important for maintaining the balance between background and
foreground tasks in Scylla.

Testing: In order to test the correctness of the patch.
First, the following assert was inserted before any call
to one of the executor functions in the http route:
assert(current_scheduling_group().name() == "statement"
Then all alternator tests ran and passed.
The second stage was to change the name so the assert
will fail:
assert(current_scheduling_group().name() == "no-statement"
And ran the tests again - validating that Scylla coredumps.
The asserts were then removed.

Fixes #5008

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20200127154341.10020-1-eliransin@scylladb.com>
2020-01-29 00:11:17 +02:00
Avi Kivity
e09ed81c23 Merge "Fix two corner cases in snapshots API" from Pavel
"
There seem to be two problems with handling snapshot API -- one
on start and the other one on stop. Here's the set that addresses
both.

The fix moved snapshot API registration later in time that required
Amnon's ACK. Now we have it :) so -- the rebase and resend.

Tests: unit(dev), start-stop
"

* 'br-snapshot-bugs-2' of https://github.com/xemul/scylla:
  snapshot: Pass requests through gate
  api: Register snapshot API later
  api: Unwrap wrap_ks_cf
2020-01-29 00:11:17 +02:00
Avi Kivity
c0f412617e Merge "Make the scylla build deterministic" from Rafael
"
With these changes and a binutils compiled with
--enable-deterministic-archives, the only difference I get in the
build directory if I build scylla twice from scratch are:

* The various CMakeError.log because they have temporary file names.
* The various CMakeOutput.log for the same reason.
* .ninja_log and .ninja_deps. I am not sure what the contents are.
"

* 'espindola/fix-determinism' of https://github.com/espindola/scylla:
  build: remove timestamps from then antlr output
  build: Make the output of idl-compiler deterministic
2020-01-28 18:16:06 +02:00
Rafael Ávila de Espíndola
0e8bee0774 configure: Use lld if available
This depends on the patch

mk: avoid combining -r and -export-dynamic linker options

being added to dpdk.

I benchmarked this on top of my patches to get a reproducible build. I
first compiled with ccache, deleted the build directory and recompiled
so that all the "gcc -c" invocations were served by ccache. The times
of the second "ninja release" invocations were:

lld:
ninja release  155.68s user 71.89s system 2077% cpu 10.953 total

gold:
ninja release  953.79s user 254.71s system 2533% cpu 47.699 total

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200127171516.26268-1-espindola@scylladb.com>
2020-01-28 18:15:50 +02:00
Avi Kivity
7440125cb1 Update seastar submodule
> memory: add scoped_heap_profiling
  > build: add switch to enable heap profiling support
  > io_tester: do not abort on end of test
  > resource: clean up cgroups version determination.
  > prometheus: Silence a bogus gcc warning in http server
  > Update dpdk submodule
  > resource: Support cgroups v2
  > net: Don't use variable length arrays
  > core/memory.hh: document set_heap_profiling_enabled()
  > Revert "net: Don't use variable length arrays"
  > cmake: fix pkgconfig boost deps
  > thread: Avoid confusing comment by switching value
  > net: posix-stack: fix allocator in ap listening sockets
  > net: posix-stack: fix passing allocator to new sockets
  > stall_detector: Add a counter for stall detector report
  > Merge "Don't use variable length arrays" from Rafael
  > treewide: fix minor issues reported by clang
  > thread: Call mprotect in make_stack
  > thread: Always allocate stack with aligned_alloc
  > build: Make SEASTAR_THREAD_STACK_GUARDS private
  > thread: Move code out of a header
2020-01-28 18:15:18 +02:00
Nadav Har'El
b06b34478e merge: lwt: add lightweight transaction unit tests
Merged patch series from Konstantin Osipov:

This series sets cql_repl core count to 1 and adds LWT
unit tests.

  test.py: invoke cql_repl with smp=1
  lwt: add lightweight transactions unit tests
2020-01-28 12:39:23 +02:00
Nadav Har'El
30283f2544 merge: Alternator: return api_error instead of throwing
Merged patch series from Piotr Sarna:

In order to minimize the usage of throws and catches in code paths
that are potentially hot, these paths instead return appropriate errors
directly.

The server layer is still able to catch and translate errors,
but the preferred way is to return api_error directly in places
that may be performance-sensitive.

Tests: alternator-test(local)
Fixes #5472

Piotr Sarna (3):
  alternator: change request return type to variant<value, error>
  alternator: elide throwing in condition checks
  alternator: replace top-level throws with returns in executor

 alternator/executor.hh |  28 ++++----
 alternator/server.hh   |   4 +-
 alternator/executor.cc | 141 +++++++++++++++++++++--------------------
 alternator/server.cc   |  44 ++++++++-----
 4 files changed, 117 insertions(+), 100 deletions(-)
2020-01-28 12:39:23 +02:00
Konstantin Osipov
98c34ae750 test.py: always build cql_repl, do not strip
Exclude cql_repl from the list of tests, since it's not a test.
Build it as a separate app. Do not strip, so that any CQL test
failure is easy to debug without a rebuild.

All test-related targets are converted from lists to sets to avoid
quadratic lookup cost in the check inside the loop which creates the
ninja file.
2020-01-28 12:39:23 +02:00
Piotr Sarna
a81640d402 alternator: replace top-level throws with returns in executor
In order to elide unnecessary throwing, all errors previously thrown
from top-level executor methods (the ones that handle user requests)
are now returned directly.
Message-Id: <73e05d1057ee842576fae11be9d77265ffb2e96f.1579515640.git.sarna@scylladb.com>
2020-01-28 12:39:23 +02:00
Takuya ASADA
f21123b3ae scylla_io_setup: Improve error message for unsupported EC2 instance types (#5561)
Currently --ami does not check instance types, creates invalid
io_properties.yaml on unsupported instance types.

It actually won't occur on AMI startup, since scylla_ami_setup only
invoke scylla_io_setup --ami when the instance is supported, so we don't
get the issue on startup, but we still get when we run scylla_io_setup
manually.

It's better to check instance type on scylla_io_setup, too.

Refs #5438
2020-01-28 12:39:23 +02:00
Piotr Sarna
854adf5b70 alternator: elide throwing in condition checks
Conditional updates inform the user that the condition is not met
by returning an error. An initial implementation was based on rethrowing
these errors, but returning them directly is considered better
for performance.
2020-01-28 12:39:23 +02:00
Gleb Natapov
0d0c05a569 lwt: allow only one paxos instance to run for each key simultaneously
This will prevent contention in case of parallel updates of the same row
by the same coordinator. The patch does it by introducing a new per key
lock map and taking it before running PAXOS protocol (either for write
of for read).

Message-Id: <20200117101228.GA14816@scylladb.com>
2020-01-28 12:39:23 +02:00
Piotr Sarna
a6a65abc3c alternator: change request return type to variant<value, error>
In order to minimize the use of exceptions during normal operations,
each request handler is now able to return either a proper JSON value,
or an instance of api_error, which indicates that something went wrong,
but without having to throw, catch and rethrow C++ exceptions.
This is especially important for conditional updates, since it's
expected to be common to return ConditionalCheckFailedException.
Message-Id: <d8996a0a270eb0d9db8fdcfb7046930b96781e69.1579515640.git.sarna@scylladb.com>
2020-01-28 12:39:23 +02:00
Avi Kivity
897320f6ab tools: toolchain: dbuild: relax process limit in container
Docker restricts the number of processes in a container to some
limit it calculates. This limit turns out to be too low on large
machines, since we run multiple links in parallel, and each link
runs many threads.

Remove the limit by specifying --pids-limit -1. Since dbuild is
meant to provide a build environment, not a security barrier,
this is okay (the container is still restricted by host limits).

I checked that --pids-limit is supported by old versions of
docker and by podman.

Fixes #5651.
Message-Id: <20200127090807.3528561-1-avi@scylladb.com>
2020-01-28 12:39:23 +02:00
Avi Kivity
c7e0be75a5 Merge "Metrics for full scan" from Alejo
"
Final set of changes for full scan metrics.

    - allow filtering
    - full scan (Note: non-system tables only)
    - full scan without BYPASS CACHE option
    - tests for all metrics (bypass cache, allow filtering, full scan)
    - works with prepared statements (tested, too)
"

* 'as_full_scan_metrics' of https://github.com/alecco/scylla:
  Range scan query counter
  Counter of queries doing full scan.
  ALLOW FILTERING query counter
2020-01-28 12:39:23 +02:00
Botond Dénes
e4616f92fe test/manual/sstable_scan_footprint_test: improve memory consumption diagnostics
This test is all about tracking measured memory consumption vs. real
memory consumption. To make this easier add additional diagnostics:
* enable seastar heap profiler for the duration of the reads (seastar
  has to be compiled with `-DSEASTAR_HEAPPROF`).
* Add a stats collector, which periodically collects stats such as
  non-LSA free/used memory, LSA free/used memory and memory tracked by
  the reader concurrency semaphore. These stats are written to a `.csv`
  file, allowing importing them into a spreadsheet and processing them.
2020-01-28 10:15:55 +02:00
Botond Dénes
9e9c59d125 tests/manual/sstable_scan_footprint_test: use the semaphore to determine read rate
Currently the test fires the configured amount of reads at once. This is
somewhat restricting in the number of testable scenarios. For example,
it doesn't allow one to see if the semaphore correctly tracks the memory
consumption of existing reads, by firing new reads after a while.

Replace this algorithm by one which fires reads with a configured
concurrency, then waits for the semaphore's queue (if any) to drain,
before firing new reads. The test can now be configured with the total
amount of reads to fire, and with the read-concurrency, i.e. the number
of reads to fire at once in each iteration.

This allows for much greater flexibility in the different test
scenarios. The previous behaviour can still be achieved by configuring
a concurrency of 100.

This patch also adds better error handling. Reads are aborted on the
first error and errors are caught and not allowed to bubble up past the
test's main function and are logged instead.

Extensive logging is also added to be able to monitor the system while
the test is running.
2020-01-28 10:15:53 +02:00
Tomasz Grabiec
2eb88024c0 tests/manual: Add test measuring memory demand of concurrent sstable reads
Allow manual experimentation with the effectiveness of the accuracy of
the tracking of the resource consumption of readers, and hence the
system's ability to prevent overload and the dreaded `std::bad_alloc`.

This patch was originally developed by
Tomasz Grabiec <tgrabiec@scylladb.com>, I only adapted it to compile and
link on current master.
2020-01-28 08:13:16 +02:00
Botond Dénes
dfc66194c8 index_reader: make the index file tracked
Track I/O going to the index file, similarly to how we already track I/O
going to the data file.
2020-01-28 08:13:16 +02:00
Botond Dénes
936619a8d3 sstables/continuous_data_consumer: track buffers used for parsing
Based on heap profiling, buffers used for storing half-parsed fields are
a major contributor to the overall memory consumption of reads. This
memory was completely "under the radar" before. Track it by using
tracked `temporary_buffer` instances everywhere in
`continuous_data_consumer`. As `continuous_data_consumer` is the basis
for parsing all index and data files, adding the tracing here
automatically covers all data, index and promoted index parsing.

I'm almost convinced that there is a better place to store the `permit`
then the three places now, but so far I was unable to completely
decipher the our data/index file parsing class hierarchy.
2020-01-28 08:13:16 +02:00
Botond Dénes
92fffe51d5 reader_concurrency_semaphore: tracking_file_impl: consume memory speculatively
Consume the memory before even submitting the I/O to the underlying
`file` object. This is in line with the underlying `file` object
allocating the buffer before it forwards the I/O request to the kernel.
This extends the "visibility" over the memory consumed by I/O greatly,
as it turns out buffers spend most time alive waiting for the I/O to
complete and are parsed shortly afterwards.
2020-01-28 08:13:16 +02:00
Botond Dénes
4bb3c7b1f0 reader_concurrency_semaphore: bye reader_resource_tracker
Replaced by `reader_permit`, of which it was a mere wrapper of in the
first place.
2020-01-28 08:13:16 +02:00
Botond Dénes
dfc8b2fc45 treewide: replace reader_resource_tracer with reader_permit
The former was never really more than a reader_permit with one
additional method. Currently using it doesn't even save one from any
includes. Now that readers will be using reader_permit we would have to
pass down both to mutation_source. Instead get rid of
reader_resource_tracker and just use reader_permit. Instead of making it
a last and optional parameter that is easy to ignore, make it a
first class parameter, right after schema, to signify that permits are
now a prominent part of the reader API.

This -- mostly mechanical -- patch essentially refactors mutation_source
to ask for the reader_permit instead of reader_resource_tracking and
updates all usage sites.
2020-01-28 08:13:16 +02:00
Botond Dénes
dea24ca859 reader_permit: expose make_tracked_temporary_buffer()
Previously `tracking_file_impl::make_tracked_buf()`. In the next patches
we plan on using this outside `tracking_file_impl`, so make it public
and templatize on the char type.
2020-01-28 08:13:16 +02:00
Botond Dénes
16cea36a94 reader_permit: introduce make_tracked_file()
Free function equivalent of `reader_resource_tracker::track_file()`,
using a `reader_permit` directly.
2020-01-28 08:13:16 +02:00
Botond Dénes
1859a03629 reader_permit: introduce memory_units
Similar to `seastar::semaphore_units`, this allows consuming and
releasing memory via an RAII object. In addition to that, it also allows
tracking changing values. This feature was designed to be used for
tracking the ever changing memory consumption of the buffers of
`flat_mutation_reader`:s.
This is now the only supported way of consuming memory from a permit.
2020-01-28 08:13:16 +02:00
Botond Dénes
c0f96db2d9 reader_concurrency_semaphore: mv reader_resources and reader_permit to reader_permit.hh
In the next patches we will replace `reader_resource_tracker` and have
code use the `reader_permit` directly. In subsequent patches, the
`reader_permit` will get even more usages as we attempt to make the
tracking of reader resource more accurate by tracking more parts of it.
So the grand plan is that the current `reader_concurrency_semaphore.hh`
is split into two headers:
* `reader_concurrency_semaphore.hh` - containing the semaphore proper.
* `reader_permit.hh` - a very lightweight header, to be used by
  components which only want to track various parts of the resource
  consumption of reads.
2020-01-28 08:13:16 +02:00
Botond Dénes
2005495857 reader_concurrency_semaphore: reader_permit: make it a value type
Currently `reader_permit` is passed around as
`lw_shared_ptr<reader_permit>`, which is clunky to write and use and is
also an unnecessary leak of details on how permit ownership is managed.
Make `reader_permit` a simple value type, making it a little bit easier
and safer to use.
In the next patches we will get rid of `reader_resource_tracker` and
instead have code use the permit instance directly, so this small
improvement in usability will go a long way towards preventing eye sore.
2020-01-28 08:13:16 +02:00
Botond Dénes
932bc02730 reader_concurrency_semaphore: s/resources/reader_resources/
In preparation of making it a top-level class and moving it to another
file.
2020-01-28 08:13:16 +02:00
Botond Dénes
89c5fd0c25 reader_concurrency_semaphore::reader_permit: move methods out-of-line
In preparation for making the reader_permit a top-level class, and
moving it to another file. It is also good practice to define
non-performance critical methods out-of-line to reduce header bloat.
2020-01-28 08:13:16 +02:00
Konstantin Osipov
511ae023f0 lwt: add lightweight transactions unit tests
These unit tests cover all CQL aspects of lightweight transactions,
such as grammar, null semantics, batch semantics, result set
format, and so on.

For now, comment out unicode tests: test output depends
on libjsoncpp version in use.
2020-01-27 23:09:57 +03:00
Konstantin Osipov
fef50b66a2 test.py: invoke cql_repl with smp=1
Since bounce_to_shard is not handled by cql_repl, invoke it with
smp=1 until it is fixed.
2020-01-27 22:57:10 +03:00
Pavel Emelyanov
976463f620 snapshot: Pass requests through gate
When the scylla process is stopped no code waits for
current snapshot operations to finish. Also, the API
server is not stopped either, so new snapshot requests
can creep into.

In seastar there's a useful abstraction to address both.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-27 17:42:04 +03:00
Pavel Emelyanov
fd6b5efe75 api: Register snapshot API later
In storage_service's snapshot code there are checks for
_operation_mode being _not_ JOINING to proceed. The intention
is apparently to allow for snapshots only after the cluster
join. However, here's how the start-up code looks like

- _operation_mode = STARTING in storage_service::constructor
- snapshot API registered in api::set_server_storage_service
- _operation_mode = JOINING in storage_service::join_token_ring

So in between steps 2 and 3 snapshots can be taken.

Although there's a quick and simple fix for that (check for the
_operation_mode to be not STARTING either) I think it's better
to register the snapshot API later instead. This will help
greatly to de-bload the storage_service, in particular -- to
incapsulate the _operation_mode properly.

Note, though the check for _operation_mode is made only for
taking snapshot, I move all snapshot ops registration to the
later phase.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-27 17:42:04 +03:00
Pavel Emelyanov
4886c1db74 api: Unwrap wrap_ks_cf
This is preparation for the next patch -- the lambda in
question (and the used type) will be needed in two
functions, so make the lambda a "real" function.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-27 17:42:04 +03:00
Benny Halevy
10c912d3db cql3: abstract_function_selector: provide assignment_testable_source_context
Return function name.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-27 11:09:01 +02:00
Benny Halevy
35e9538d49 test: cql_query_test: add time uuid validation tests
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-27 11:09:01 +02:00
Benny Halevy
1078c86af9 cql3: time_uuid_fcts: validate timestamp arg
Make sure that the timestamp argument does not overflow
60 bits when converted to units of 100 nanos since
epoch, like with writetime() that returns microseconds since epoch
in contrast to other time functions like
unixtimestampof that return millis since epoch.

Fixes #5552

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-27 11:09:01 +02:00
Benny Halevy
fa0fa53bd3 cql3: make_max_timeuuid_fct: delete outdated FIXME comment
Done in 86c09046fd

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-27 11:09:01 +02:00
Benny Halevy
72e2ea47c1 cql3: time_uuid_fcts: validate time UUID
Throw an error in case we hit an invalid time UUID
rather than hitting an assert.

Ref #5552

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-27 11:09:01 +02:00
Benny Halevy
00bd1d32d3 test: UUID_test: add tests for time uuid
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-27 11:09:01 +02:00
Benny Halevy
f8b079b599 utils: UUID: create_time assert nanos_since validity
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-27 11:09:01 +02:00
Benny Halevy
cd3460cc88 utils/UUID_gen: make_nanos_since
Safely convert millis to "nanos_since" (number of 100
nanseconds since START_EPOCH) while type casting to uint64_t
to avoid possible int overflow.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-27 11:08:16 +02:00
Benny Halevy
22bac26023 utils: UUID: assert UUID.is_timestamp
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-26 18:54:36 +02:00
Avi Kivity
cc0222ec2d Merge "Futurize get_changed_ranges_for_leaving" from Asias
"
Futurize get_changed_ranges_for_leaving to fix stalls like:

   2019-12-17T15:18:33+00:00 ip-10-0-116-62 !INFO | scylla: Reactor stalled
   for 4609 ms on shard 0.

   0x0000000002accbd2
   0x0000000002a4579b
   0x0000000002a45cc2
   0x0000000002a45ff7
   0x00007ff0a609be7f
   0x0000000001b0b500
   0x0000000001b03185
   0x0000000001af0d41
   0x0000000001af027a
   0x0000000001f7e89a
   0x0000000001f9f55a
   0x0000000001fc9c09
   0x0000000001fcac08
   0x00000000007dfee3

   /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/src/core/reactor.cc:1041
   (inlined by) seastar::reactor::block_notifier(int) at
   /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/src/core/reactor.cc:1164
   ?? ??:0 __gnu_cxx::__normal_iterator<dht::token const*,
   std::vector<dht::token, std::allocator<dht::token> > >
   std::__lower_bound<__gnu_cxx::__normal_iterator<dht::token const*,
   std::vector<dht::token, std::allocator<dht::token> > >, dht::token,
   __gnu_cxx::__ops::_Iter_less_val>(__gnu_cxx::__normal_iterator<dht::token
   const*, std::vector<dht::token, std::allocator<dht::token> > >,
   __gnu_cxx::__normal_iterator<dht::token const*, std::vector<dht::token,
   std::allocator<dht::token> > >, dht::token const&,
   __gnu_cxx::__ops::_Iter_less_val) at crtstuff.c:?
   locator::token_metadata::first_token_index(dht::token const&) const at
   crtstuff.c:? locator::token_metadata::ring_range(dht::token const&, bool) const at crtstuff.c:?
   locator::simple_strategy::calculate_natural_endpoints(dht::token const&,
   locator::token_metadata&) const at crtstuff.c:?
   service::storage_service::get_changed_ranges_for_leaving(seastar::basic_sstring<char,
   unsigned int, 15u, true>, gms::inet_address) at crtstuff.c:?
   service::storage_service::unbootstrap() at crtstuff.c:?
   service::storage_service::decommission()::{lambda(service::storage_service&)#1}::operator()(service::storage_service&)
   const::{lambda()#1}::operator()() const [clone .isra.0] at
   storage_service.cc:?

Refs: #5495
"

* 'futurize_get_changed_ranges_for_leaving' of https://github.com/asias/scylla:
  storage_service: Yield in get_changed_ranges_for_leaving
  storage_service: Make get_changed_ranges_for_leaving run inside thread
2020-01-26 13:25:53 +02:00
Takuya ASADA
dd81fd3454 dist/debian: Use tilde for release candidate builds
We need to add '~' to handle rcX version correctly on Debian variants
(merged at ae33e9f), but when we moved to relocated package we mistakenly
dropped the code, so add the code again.

Fixes #5641
2020-01-26 13:25:53 +02:00
Ivan Prisyazhnyy
4c001553eb dep/arch: better messages
Tested on Arch 5.4.2-arch1-1 and docker archlinux.

Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
Message-Id: <20200125122836.460811-1-ivan@scylladb.com>
2020-01-26 12:02:32 +02:00
Ivan Prisyazhnyy
98a8c36c60 cmake: fix seastar and gen include dirs lookup
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
Message-Id: <20200125145926.545859-1-ivan@scylladb.com>
2020-01-26 12:02:32 +02:00
Dejan Mircevski
90b54c8c42 view_info: Drop partition_ranges()
The method view_info::partition_ranges() is unused.

Also drop the now-dead _partition_ranges data member.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-01-26 12:02:32 +02:00
Piotr Sarna
9fa88e26a9 Merge 'Alternator - LWT and ConditionExpression' from Nadav
This is a fourth iteration of the patch series adding LWT usage
(instead of the old naive - and wrong - read before write) to
Alternator, as well as full support for the ConditionExpression
syntax for conditional updates.

Changes in v4:

* Rebased to most recent master
* Replaced 3 booleans which had 2^3 = 8 theoretical combinations,
  by just 4 options in enum write_isolation:
        FORBID_RMW, LWT_ALWAYS, LWT_RMW_ONLY, UNSAFE_RMW
  The four options are described in details comments.
* Fix reversed assertion in FORBID_RMW case.
* Two new metrics: write_using_lwt and shard_bounce_for_lwt.
* Fail boot if alternator is enabled, but LWT isn't.
* Add information about enabling LWT in docs/alternator/alternator.md

* nyh/v4-lwt:
  alternator: add support for ConditionExpression
  alternator: reimplement read-modify-write operations using LWT
  alternator: make "executor" a peering_sharded_service
2020-01-26 12:02:32 +02:00
Alejo Sanchez
936cae6069 Range scan query counter
Fixes #5209

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-01-24 15:02:58 +01:00
Alejo Sanchez
f57513a809 Counter of queries doing full scan.
In scope of #5209

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-01-24 14:25:19 +01:00
Alejo Sanchez
dbe8a54768 ALLOW FILTERING query counter
Implements a counter of executions of SELECT queries with ALLOW FILTERING option.

In scope of #5209

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-01-24 13:38:30 +01:00
Piotr Jastrzebski
682dfdafe1 partitioners: remove random_partitioner
Previous patch makes it impossible to configure Scylla
with RandomPartitioner so this code is effectively dead now.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-24 09:09:13 +01:00
Piotr Jastrzebski
d80ac4c2d0 partitioners: Make it impossible to use RandomPartitioner
RandomPartitioner has been deprecated for 2.5 year.
Now we drop the support for it. There are two reasons for this.
First, this partitioner can lead to uneven distribution of partitions
among the nodes in the cluster which leads to hot nodes.
Second, we're planning to unify the representation of tokens and
fix it as int64_t. RandomPartitioner does not comply with this.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-24 09:09:13 +01:00
Piotr Jastrzebski
7a86e2ff46 partitioners: remove byte_ordered_partitioner
Previous patch makes it impossible to configure Scylla
with ByteOrderedPartitioner so this code is effectively dead now.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-24 09:09:13 +01:00
Piotr Jastrzebski
130eb91636 partitioners: Make it impossible to use ByteOrderedPartitioner
ByteOrderedPartitioner has been deprecated for 2.5 year.
Now we drop the support for it. There are two reasons for this.
First, this partitioner can lead to uneven distribution of partitions
among the nodes in the cluster which leads to hot nodes.
Second, we're planning to unify the representation of tokens and
fix it as int64_t. ByteOrderPartitioner does not comply with this.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-24 09:09:13 +01:00
Piotr Jastrzebski
4088be2056 partitioners: Remove leftovers of OrderPreservingPartitioner
OrderPreservingPartitioner seems to be long gone and not supported
so remove all the places it's still mentioned.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-24 09:09:13 +01:00
Piotr Jastrzebski
1d345091f6 i_partitioner.cc: stop including byte_ordered_partitioner.hh
Nothing from that header is used in i_partitioner.cc.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-24 09:09:13 +01:00
Piotr Jastrzebski
44c9a71686 i_partitioner.cc: stop including random_partitioner.hh
Nothing from that header is used in i_partitioner.cc.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-24 09:09:13 +01:00
Piotr Jastrzebski
6a2cd64b5c config: use allowed_values to verify named_value input
Even though we configure the set of accepted values for
some config flags, named_value ignore them.

This patch implements the checks that verify flag is
not set to the value that's not on the list.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-24 09:08:59 +01:00
Nadav Har'El
b50274e8a7 alternator: add support for ConditionExpression
This patch adds support for the ConditionExpression parameter of the
item-writing operations in Alternator: PutItem, UpdateItem and DeleteItem.

We already supported conditional updates/put/delete using the "Expected"
parameter. The ConditionExpression parameter implemented here provides a
very similar feature, using a different - and also newer and more powerful -
syntax.

The implementation here reuses much of our existing expression-parsing
infrastructure. Unsurprisingly, ConditionExpression's syntax has much in
common with UpdateExpression which we already support) and also many of the
comparison functions already implemented for "Expected". However, it's still
quite a bit of new code, because of the many different comparisons, functions,
and syntax variations we need to support.

This patch also expands alternator-test/test_condition_expression.py with
a few additional corner cases discovered during the development of this
patch.

Almost all of the tests for this feature (35 out of 39) now pass.
Two tests still fail because we don't yet support nested attributes (this
is a missing feature across Alternator), and two tests fail because of minor
ideosyncracies in DynamoDB's error path that we chose not to duplicate
yet (but still remember the difference in the form of an xfailing test).

Fixes #5035

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-01-23 13:57:33 +02:00
Nadav Har'El
370b963ce5 alternator: reimplement read-modify-write operations using LWT
In this patch, we re-implement the three read-modify-write operations -
PutItem, UpdateItem, DeleteItem. All three operations may need to read the
item before writing it to support conditional updates (the "Expected"
parameter) and UpdateItem may also need the previous item's value for
its update expression (e.g., a user may ask to "set a=a+1" or "set a=b").

Before this patch, the implementation of RMW operations simply did a read,
and then a write - without any attempt to protect concurrent operations.

In this patch, Scylla's LWT mechanism (storage_proxy::cas()) is used
instead, to ensure that concurrent update operations are correctly
isolated even if they are conditional. This means that Alternator now
requires the experimental LWT feature to be enabled (and refuses to
boot if it isn't).

The version presented here is configured to always use LWT for *every*
write, regardless of whether it has a condition or not. So it will
will significantly slow down write-only workloads like YCSB. But the code
in this patch actually includes three other modes, which can be chosen by
setting an enum constant in the code. In the future we will want to let the
user configure this mode, globally, per table or per attribute.

Note that read requests are NOT modified, and work exactly as they did
before: i.e., strongly-consistent reads are done using a normal
CL=LOCAL_QUORUM read - not via LWT. I believe this is good enough given
Dynamo's guarantees, and critical for our read performance.

Also note that patch doesn't yet fix the BatchWriteItem operation.
Although BatchWriteItem does not support any RMW operations - just pure
writes - we may still need to do those pure writes using LWT. This
should be fixed in a follow-up patch.

Unfortunately, this patch involves a large amount of code movement and
reorganization, because:
1. The cas operation requires each operation to be made into an object,
   with a separate apply() function, forcing a lot of code to move.
2. Moreover, we need to do this for three different operations (PutItem,
   UpdateItem, DeleteItem) so to avoid massive code duplication, I had
   to move some common code.
3. The cas operation also forced us to change some of the utility functions'
   APIs.

The end result is that this patch focuses more on a compact and
understandable *end result* than it does on an easy to understand *patch*,
so reviewers - sorry about that.

All alternator-test/ tests pass with this patch (and also with all of the
different optional modes enabled). However, other than that, I did not yet
do any real isolation tests (are concurrent operations really isolated
correctly? or is LWT just faking it? :-) ), performance tests or stress
tests - and I'll definitely need to do those as well.

Fixes #5054

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-01-23 13:57:28 +02:00
Nadav Har'El
7dfd081e0d alternator: make "executor" a peering_sharded_service
Alternator uses a sharded<executor> for handling execution of Alternator
requests on different shards. In this patch we make executor a subclass of
peering_sharded_service, to allow one of these executors to run an exector
method on a different shard: Any one of the shard-local executor instances
can call container() to get the full sharded<executor>.

We will need this capability later, when we need to bounce requests between
shards because of requirements of the storage_proxy::cas (LWT) code.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-01-23 13:57:23 +02:00
Benny Halevy
5b0ea4c114 storage_service: drain_on_shutdown: unregister storage_proxy subscribers from local_storage_service
Match subscription done in main() and avoid cross shard access
to _lifecycle_subscribers vector.

Fixes #5385

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Acked-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200123092817.454271-1-bhalevy@scylladb.com>
2020-01-23 11:38:23 +02:00
Piotr Jastrzebski
df1b7d2805 config: add operator<< for seed_provider_type
Following patch will start checking allowed_values
in named_value and print errors for wrong values.
This will require all the types used with named_value
to have operator<< implemented. seed_provider_type
is one such type.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-23 10:28:58 +01:00
Rafael Ávila de Espíndola
6058fe8007 build: remove timestamps from then antlr output
The output of antrl always has the timestamp of when it was
created. This expands the existing sed hack to remove that too.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-22 16:29:54 -08:00
Rafael Ávila de Espíndola
72e900291b build: Make the output of idl-compiler deterministic
If at any point during the topological sort we had more than one node
with zero dependencies, the order they were printed was not
deterministic.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-22 16:28:00 -08:00
Avi Kivity
46951f8b1a Merge "Refactor migration_notifier listeners and gossip subscribers" from Rafael
"
This series refactors the code used by migration_notifier and gossiper
into an atomic_vector type.
"

* 'espindola/gossiper_atomic_vector' of https://github.com/espindola/scylla:
  gossiper: Store subscribers in an atomic_vector
  load_broadcaster: Unregister from load_broadcaster::stop_broadcasting
  repair: add row_level::stop()
  locator: Return future from i_endpoint_snitch::reload_gossiper_state
  service: Refactor code into a atomic_vector class
  migration_manager: Fix typo
  load_meter: Use a shared_ptr to store a load_broadcaster
2020-01-22 18:58:15 +02:00
Rafael Ávila de Espíndola
845116dfaf gossiper: Store subscribers in an atomic_vector
The new guarantees are a bit better IMHO:

Once a subscriber is removed, it is never notified. This was not true
in the old code since it would iterate over a copy that would still
have that subscriber.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-22 08:16:03 -08:00
Rafael Ávila de Espíndola
c62a33965d load_broadcaster: Unregister from load_broadcaster::stop_broadcasting
This is in preparation for unregistration returning a future.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-22 08:16:03 -08:00
Rafael Ávila de Espíndola
7390485e20 repair: add row_level::stop()
Now unregister_ is called from stop(). This reduces the noise in a
followup patch.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-22 08:16:03 -08:00
Rafael Ávila de Espíndola
085544f054 locator: Return future from i_endpoint_snitch::reload_gossiper_state
This just reduces the noise of a followup patch.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-22 08:16:03 -08:00
Rafael Ávila de Espíndola
d9a71a7cff service: Refactor code into a atomic_vector class
This templates the code for listener_vector, renames it to
atomic_vector and moves it to the utils directory.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-22 08:16:03 -08:00
Rafael Ávila de Espíndola
baeb6744f6 migration_manager: Fix typo
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-22 08:16:03 -08:00
Rafael Ávila de Espíndola
9d4cf25c84 load_meter: Use a shared_ptr to store a load_broadcaster
load_broadcaster::stop_broadcasting uses shared_from_this(). Since
that is the only reference that the produced shared_ptr knows of, it
is deleted immediately. Fix that by also using a shared_ptr in
load_meter.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-22 08:16:03 -08:00
Pekka Enberg
0abb4e1742 Update seastar submodule
* seastar afc46681...147d50b1 (6):
  > perftune.py: Use safe_load() for fix arbitrary code execution
Fixes #5630
  > clang: current_exception_as_future must be in namespaced
  > tests: add an expected failures version of thread fixture
  > Enable stack guards in Dev builds
  > net: posix: Introduce load_balancing_algorithm::fixed
  > stream: Move _next from subscription to stream
2020-01-22 17:54:14 +02:00
Pavel Solodovnikov
e1b22b6a4c cql3: get rid of lw_shared_ptr for variable_specifications
`parsed_statement::get_bound_variables` is assumed to always
return a nonnull pointer to `variable_specifications` instance.

In this case using a pointer is superfluous and can be safely
replaced by a plain reference.

Also add a default ctor and a utility method `set_bound_variables`
to the `variable_specifications` class to actually reset the
contents of the class instance.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200120195839.164296-1-pa.solodovnikov@scylladb.com>
2020-01-22 12:51:02 +02:00
Avi Kivity
5d78d511ad Merge "cql: Simplify sum overflow" from Benny
"
As a followup to 0bde590
This series implements suggestions from @avikivity and @espindola
It simplifies the template definitions for accumulator_for,
adds some debug logging for the overflow values,
and adds unit tests for float and double sum overflow.

Test: unit(dev),
paging_test:TestPagingWithIndexingAndAggregation.test_filter_{indexed,non_indexed,pk}_column(dev)
"

* 'simplify-sum-overflow' of https://github.com/bhalevy/scylla:
  test: cql_query_test: test float/double sum overflow
  cql3: aggregate_fcts: simplify accumulator_for template definitions
2020-01-22 11:30:25 +02:00
Asias He
be9d7c3b28 storage_service: Yield in get_changed_ranges_for_leaving
It is always called inside a seastar thread. Call yield to prevent
stalls.

This patch fixes stalls like:

   2019-12-17T15:18:33+00:00 ip-10-0-116-62 !INFO | scylla: Reactor stalled
   for 4609 ms on shard 0.

   0x0000000002accbd2
   0x0000000002a4579b
   0x0000000002a45cc2
   0x0000000002a45ff7
   0x00007ff0a609be7f
   0x0000000001b0b500
   0x0000000001b03185
   0x0000000001af0d41
   0x0000000001af027a
   0x0000000001f7e89a
   0x0000000001f9f55a
   0x0000000001fc9c09
   0x0000000001fcac08
   0x00000000007dfee3

   /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/src/core/reactor.cc:1041
   (inlined by) seastar::reactor::block_notifier(int) at
   /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/src/core/reactor.cc:1164
   ?? ??:0 __gnu_cxx::__normal_iterator<dht::token const*,
   std::vector<dht::token, std::allocator<dht::token> > >
   std::__lower_bound<__gnu_cxx::__normal_iterator<dht::token const*,
   std::vector<dht::token, std::allocator<dht::token> > >, dht::token,
   __gnu_cxx::__ops::_Iter_less_val>(__gnu_cxx::__normal_iterator<dht::token
   const*, std::vector<dht::token, std::allocator<dht::token> > >,
   __gnu_cxx::__normal_iterator<dht::token const*, std::vector<dht::token,
   std::allocator<dht::token> > >, dht::token const&,
   __gnu_cxx::__ops::_Iter_less_val) at crtstuff.c:?
   locator::token_metadata::first_token_index(dht::token const&) const at
   crtstuff.c:? locator::token_metadata::ring_range(dht::token const&, bool) const at crtstuff.c:?
   locator::simple_strategy::calculate_natural_endpoints(dht::token const&,
   locator::token_metadata&) const at crtstuff.c:?
   service::storage_service::get_changed_ranges_for_leaving(seastar::basic_sstring<char,
   unsigned int, 15u, true>, gms::inet_address) at crtstuff.c:?
   service::storage_service::unbootstrap() at crtstuff.c:?
   service::storage_service::decommission()::{lambda(service::storage_service&)#1}::operator()(service::storage_service&)
   const::{lambda()#1}::operator()() const [clone .isra.0] at
   storage_service.cc:?

Refs: #5495
2020-01-22 12:36:15 +08:00
Asias He
74b787c91a storage_service: Make get_changed_ranges_for_leaving run inside thread
It is the only place where get_changed_ranges_for_leaving is not running
inside a thread. Preparing patch to futurize get_changed_ranges_for_leaving.

Refs: #5495
2020-01-22 12:36:13 +08:00
Piotr Sarna
9b379e3d63 db,view: fix checking for secondary index special columns
A mistake in handling legacy checks for special 'idx_token' column
resulted in not recognizing materialized views backing secondary
indexes properly. The mistake is really a typo, but with bad
consequences - instead of checking the view schema for being an index,
we asked for the base schema, which is definitely not an index of
itself.

Branches 3.1,3.2 (asap)
Fixes #5621
Fixes #4744
2020-01-21 22:32:04 +02:00
Rafael Ávila de Espíndola
27bd3fe203 service: Add a lock around migration_notifier::_listeners
Before this patch the iterations over migration_notifier::_listeners
could race with listeners being added and removed.

The addition side is not modified, since it is common to add a
listener during construction and it would require a fairly big
refactoring. Instead, the iteration is modified to use indexes instead
of iterators so that it is still valid if another listener is added
concurrently.

For removal we use a rw lock, since removing an element invalidates
indexes too. There are only a few places that needed refactoring to
handle unregister_listener returning a future<>, so this is probably
OK.

Fixes #5541.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200120192819.136305-1-espindola@scylladb.com>
2020-01-20 22:14:02 +02:00
Avi Kivity
c317b952a3 Merge "cql_query_test: Fix abandoned failed futures" from Rafael
"
This series fixes all abandoned failed futures in cql_query_test and
starts running it with --fail-on-abandoned-failed-futures to avoid
regressions.
"

* 'espindola/fix-abandoned-failed-futures' of https://github.com/espindola/scylla:
  cql_query_test: Avoid new abandoned failed futures
  cql_query_test: Explicitly ignore a failed future
  cql_query_test: Remove duplicated do_with_cql_env_thread
  cql_query_test: Fix cql and values in test_int_sum_with_cast
2020-01-20 20:40:56 +02:00
Rafael Ávila de Espíndola
4ce7cb9aa6 cql_query_test: Avoid new abandoned failed futures
Now that cql_query_test has no abandoned failed futures, run it with
--fail-on-abandoned-failed-futures to avoid regressions.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-20 09:23:22 -08:00
Rafael Ávila de Espíndola
ef5cd107ea cql_query_test: Explicitly ignore a failed future
This avoids an abandoned future warning.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-20 09:20:46 -08:00
Rafael Ávila de Espíndola
b547659c07 cql_query_test: Remove duplicated do_with_cql_env_thread
With this test_int_sum_with_cast now runs and passes.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-20 09:19:08 -08:00
Rafael Ávila de Espíndola
9334514c7c cql_query_test: Fix cql and values in test_int_sum_with_cast
This test is not running because of the double
do_with_cql_env_thread. Fix it before we remove the extra
do_with_cql_env_thread.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-20 09:17:35 -08:00
Avi Kivity
7d64b0f478 Update seastar submodule
* seastar 3f3e117de3...afc46681e5 (7):
  > json: add move assignment to json_return_type
  > net: do not check if an unsigned variabe is less than 0
  > stack: add virtual destructor definition for class w/ virtual functions
  > future,json: add ":" at end of concept definition
  > Fixing a bug in the handling of abort_accept()
  > install-dependencies.sh: improve arch detect
  > metrics: Avoid a copy during unregistration
2020-01-20 18:52:36 +02:00
Botond Dénes
e8a948ece6 configure.py: enable alloc failure injection for dev and debug modes
We have numerous tests that rely on the seastar alloc failure injection
infrastructure to test the exception safety of different components.
These tests are essentially useless when the said infrastructure is
not enabled, which is currently the case for all build modes, allowing
bugs to sneak in undetected.
Enable the allocation failure injection infrastructure for the dev and
debug modes. Sanitize is excluded as it produces some (suspected false
positive) failures and is not run in gating either currently.

Tests: unit(dev, debug)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200117104747.748866-1-bdenes@scylladb.com>
2020-01-20 18:07:33 +02:00
Kamil Braun
957fa8da11 dht: make i_partitioner::get_token method(s) const 2020-01-20 14:55:12 +02:00
Nadav Har'El
bd419ae723 merge: alternator: Add prerequisites for tagging
Merged patch series from Piotr Sarna:

This miniseries adds two simple prerequisites for implementing tagging:
1. A table is able to generate its Arn identifier
2. Simple tests for TagResource, UntagResource, ListTagsOfResource

In general, tags should be stored in table metadata - either by
expanding the schema of an existing schema table, e.g. scylla_tables,
or by providing another meta-table - e.g. system_schema.alternator_tables,
which stores alternator-specific metadata, like tags.

Refs #5066

Tests: alternator-test(local, remote)

Piotr Sarna (2):
  alternator: add Arn support for tables
  alternator-test: add basic tests for tags

 alternator-test/test_describe_table.py |  1 -
 alternator-test/test_tag.py            | 88 ++++++++++++++++++++++++++
 alternator/executor.cc                 |  5 ++
 3 files changed, 93 insertions(+), 1 deletion(-)
 create mode 100644 alternator-test/test_tag.py
2020-01-20 14:42:40 +02:00
Piotr Jastrzebski
9279a679da keys.hh: make it independent from schema.hh
This cuts build dependency keys.hh -> schema.hh

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-20 14:25:17 +02:00
Piotr Sarna
b8277e43e5 alternator-test: add basic tests for tags
TagResource, UntagResource and ListTagsOfResource validation tests
are added.

Refs #5066
2020-01-20 12:24:51 +01:00
Piotr Sarna
8c17b5aec4 alternator: add Arn support for tables
Several API-s, e.g. TagResource, UntagResource and ListTagsOfResource
rely on identifying tables by their "Arn". According to the docs,
an Arn should uniquely identify a resource, so it's implemented as:
  arn:KEYSPACE_NAME:TABLE_NAME
which is a minimal set of information that uniquely identifies a table
in Scylla. The `arn:` prefix is needed for compatibility purposes.

This commit adds a simple function for generating the Arn string,
and also includes it in DescribeTable result under the TableArn attribute.

Refs #5066
2020-01-20 12:24:51 +01:00
Botond Dénes
a74a82d4d2 flat_mutation_reader: mutation_fragment_stream_validator: add name
Add a name parameter to the validator, so that the validator can be
identified in log messages. Schema identity information is added to the
name automatically. This should help pinpoint the problematic place
where validation failed.
Although at the moment we have a single validator, it still benefits
from having a name, as we can now include in it the name of the sstable
being written and hence trace the source of the bad data.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200117150616.895878-1-bdenes@scylladb.com>
2020-01-20 11:06:30 +01:00
Takuya ASADA
893dfbce59 dist/ami: update packer to 1.5.1
Update Packer to 1.5.1.
Needed to rename clean_ami_name -> clean_resource_name on scylla.json, since
the variable name had been changed.
Also fixed checksum verification code, trimmed unwanted extra strings
from sha256sum output.
2020-01-20 11:24:57 +02:00
Takuya ASADA
46386beba2 install.sh: convert relocate_python_scripts.py to a bash function
Since we need to run relocate_python_scripts.py on install time,
python script may not able to run on various different environment.
So convert the script to bash script, merge it into install.sh.
2020-01-20 11:15:34 +02:00
Takuya ASADA
5627888b7c scylla_post_install.sh: fix 'integer expression expected' error
awk returns float value on Debian, it causes postinst script failure
since we compare it as integer value.
Replaced with sed + bash.

Fixes #5569
2020-01-20 11:13:55 +02:00
Asias He
343986a70b gossiper: Introduce gossip STATUS_UNKNOWN
When a node does not have gossip STATUS application_state, we currently
use an empty string to present such state in get_gossip_status.

It is better to use an explicit "UNKNOWN" to present it. It makes the
log easier to understand when the status is unknown.

 Before:

   'gossip - InetAddress n2 is now UP, status ='

 After:

   'gossip - InetAddress n2 is now UP, status = UNKNOWN'

This patch is safe because the STATUS_UNKNOWN is never sent over the
cluster. So the presentation is only internal to the node.

Fixes #5520
2020-01-20 10:59:14 +02:00
Benny Halevy
2b383b404a test: cql_query_test: test float/double sum overflow
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-20 10:42:03 +02:00
Ivan Prisyazhnyy
8fde8e3600 dep: support arch linux
Support arch linux dependencies.

Tested on Arch 5.4.2-arch1-1 and docker archlinux.

Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
Message-Id: <20200118162110.824317-1-ivan@scylladb.com>
2020-01-19 14:30:03 +02:00
Benny Halevy
476a102de0 cql3: aggregate_fcts: simplify accumulator_for template definitions
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-19 08:26:40 +02:00
Avi Kivity
12bc965f71 atomic_cell: consistently use comma as separator in pretty-printers
The atomic_cell pretty printers use a mix of commas and semicolons.
This change makes them use commas everywhere, for consistency.
Message-Id: <20200116133327.2610280-1-avi@scylladb.com>
2020-01-16 17:26:33 +01:00
Nadav Har'El
1ed21d70dc merge: CDC: do mutation augmentation from storage proxy
Merged pull request https://github.com/scylladb/scylla/pull/5567
from Calle Wilund:

Fixes #5314

Instead of tying CDC handling into cql statement objects, this patch set
moves it to storage proxy, i.e. shared code for mutating stuff. This means
we automatically handle cdc for code paths outside cql (i.e. alternator).

It also adds api handling (though initially inefficient) for batch statements.

CDC is tied into storage proxy by giving the former a ref to the latter (per
shard). Initially this is not a constructor parameter, because right now we
have chicken and egg issues here. Hopefully, Pavels refactoring of migration
manager and notifications will untie these and this relationship can become
nicer.

The actual augmentation can (as stated above) be made much more efficient.
Hopefully, the stream management refactoring will deal with expensive stream
lookup, and eventually, we can maybe coalesce pre-image selects for batches.
However, that is left as an exercise for when deemed needed.

The augmentation API has an optional return value for a "post-image handler"
to be used iff returned after mutation call is finished (and successful).
It is not yet actually invoked from storage_proxy, but it is at least in the
call chain.
2020-01-16 17:12:56 +02:00
Avi Kivity
e677f56094 Merge "Enable general centos RPM (not only centos7)" from Hagit 2020-01-16 14:13:24 +02:00
Tomasz Grabiec
36d90e637e Merge "Relax migration manager dependencies" from Pavel Emalyanov
The set make dependencies between mm and other services cleaner,
in particular, after the set:

- the query processor no longer needs migration manager
  (which doesn't need query processor either)

- the database no longer needs migration manager, thus the mutual
  dependency between these two is dropped, only migration manager
  -> database is left

- the migration manager -> storage_service dependency is relaxed,
  one more patchset will be needed to remove it, thus dropping one
  more mutual dependency between them, only the storage_service
  -> migration manager will be left

- the migration manager is stopped on drain, but several more
  services need it on stop, thus causing use after free problems,
  in particular there's a caught bug when view builder crashes
  when unregistering from notifier list on stop. Fixed.

Tests: unit(dev)
Fixes: #5404
2020-01-16 12:12:25 +01:00
Hagit Segev
d0405003bd building-packages doc: Update no specific el7 on path 2020-01-16 12:49:08 +02:00
Rafael Ávila de Espíndola
c42a2c6f28 configure: Add -O1 when compiling generated parsers
Enabling asan enables a few cleanup optimizations in gcc. The net
result is that using

  -fsanitize=address -fno-sanitize-address-use-after-scope

Produces code that uses a lot less stack than if the file is compiled
with just -O0.

This patch adds -O1 in addition to
-fno-sanitize-address-use-after-scope to protect the unfortunate
developer that decides to build in dev mode with --cflags='-O0 -g'.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200116012318.361732-2-espindola@scylladb.com>
2020-01-16 12:05:50 +02:00
Rafael Ávila de Espíndola
317e0228a8 configure: Put user flags after the mode flags
It is sometimes convenient to build with flags that don't match any
existing mode.

Recently I was tracking a bug that would not reproduce with debug, but
reproduced with dev, so I tried debugging the result of

./configure.py --cflags="-O0 -g"

While the binary had debug info, it still had optimizations because
configure.py put the mode flags after the user flags (-O0 -O1). This
patch flips the order (-O1 -O0) so that the flags passed in the
command line win.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200116012318.361732-1-espindola@scylladb.com>
2020-01-16 12:05:50 +02:00
Gleb Natapov
51281bc8ad lwt: fix write timeout exception reporting
CQL transport code relies on an exception's C++ type to create correct
reply, but in lwt we converted some mutation_timeout exceptions to more
generic request_timeout while forwarding them which broke the protocol.
Do not drop type information.

Fixes #5598.

Message-Id: <20200115180313.GQ9084@scylladb.com>
2020-01-16 12:05:50 +02:00
Piotr Jastrzębski
0c8c1ec014 config: fix description of enable_deprecated_partitioners
Murmur3 is the default partitioner.
ByteOrder and Random are the deprecated ones
and should be mentioned in the description.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-16 12:05:50 +02:00
Nadav Har'El
9953a33354 merge "Adding a schema file when creating a snapshot"
Merged pull request https://github.com/scylladb/scylla/pull/5294 from
Amnon Heiman:

To use a snapshot we need a schema file that is similar to the result of
running cql DESCRIBE command.

The DESCRIBE is implemented in the cql driver so the functionality needs
to be re-implemented inside scylla.

This series adds a describe method to the schema file and use it when doing
a snapshot.

There are different approach of how to handle materialize views and
secondary indexes.

This implementation creates each schema.cql file in its own relevant
directory, so the schema for materializing view, for example, will be
placed in the snapshot directory of the table of that view.

Fixes #4192
2020-01-16 12:05:50 +02:00
Piotr Dulikowski
c383652061 gossip: allow for aborting on sleep
This commit makes most sleeps in gossip.cc abortable. It is now possible
to quickly shut down a node during startup, most notably during the
phase while it waits for gossip to settle.
2020-01-16 12:05:50 +02:00
Avi Kivity
e5e0642f2a tools: toolchain: add dependencies for building debian and rpm packages
This reduces network traffic and eliminates time for installation when
building packages from the frozen toolchain, as well as isolating the
build from updates to those dependencies which may cause breakage.
2020-01-16 12:05:50 +02:00
Pekka Enberg
da9dae3dbe Merge 'test.py: add support for CQL tests' from Kostja
This patch set adds support for CQL tests to test.py,
as well as many other improvements:

* --name is now a positional argument
* test output is preserved in testlog/${mode}
* concise output format
* better color support
* arbitrary number of test suites
* per-suite yaml-based configuration
* options --jenkins and --xunit are removed and xml
  files are generated for all runs

A simple driver is written in C++ to read CQL for
standard input, execute in embedded mode and produce output.

The patch is checked with BYO.

Reviewed-by: Dejan Mircevski <dejan@scylladb.com>
* 'test.py' of github.com:/scylladb/scylla-dev: (39 commits)
  test.py: introduce BoostTest and virtualize custom boost arguments
  test.py: sort tests within a suite, and sort suites
  test.py: add a basic CQL test
  test.py: add CQL .reject files to gitignore
  test.py: print a colored unidiff in case of test failure
  test.py: add CqlTestSuite to run CQL tests
  test.py: initial import of CQL test driver, cql_repl
  test.py: remove custom colors and define a color palette
  test.py: split test output per test mode
  test.py: remove tests_to_run
  test.py: virtualize Test.run(), to introduce CqlTest.Run next
  test.py: virtualize test search pattern per TestSuite
  test.py: virtualize write_xunit_report()
  test.py: ensure print_summary() is agnostic of test type
  test.py: tidy up print_summary()
  test.py: introduce base class Test for CQL and Unit tests
  test.py: move the default arguments handling to UnitTestSuite
  test.py: move custom unit test command line arguments to suite.yaml
  test.py: move command line argument processing to UnitTestSuite
  test.py: introduce add_test(), which is suite-specific
  ...
2020-01-16 12:05:50 +02:00
Pekka Enberg
e8b659ec5d dist/docker: Remove Ubuntu-based Docker image
The Ubuntu-based Docker image uses Scylla 1.0 and has not been updated
since 2017. Let's remove it as unmaintained.

Message-Id: <20200115102405.23567-1-penberg@scylladb.com>
2020-01-16 12:05:50 +02:00
Avi Kivity
546556b71b Merge "allow commitlog to wait for specific entires to be flushed on disk" from Gleb
"
Currently commitlog supports two modes of operation. First is 'periodic'
mode where all commitlog writes are ready the moment they are stored in
a memory buffer and the memory buffer is flushed to a storage periodically.
Second is a 'batch' mode where each write is flushed as soon as possible
(after previous flush completed) and writes are only ready after they
are flushed.

The first option is not very durable, the second is not very efficient.
This series adds an option to mark some writes as "more durable" in
periodic mode meaning that they will be flushed immediately and reported
complete only after the flush is complete (flushing a durable write also
flushes all writes that came before it). It also changes paxos to use
those durable writes to store paxos state.

Note that strictly speaking the last patch is not needed since after
writing to an actual table the code updates paxos table and the later
uses durable writes that make sure all previous writes are flushed. Given
that both writes supposed to run on the same shard this should be enough.
But it feels right to make base table writes durable as well.
"

* 'gleb/commilog_sync_v4' of github.com:scylladb/seastar-dev:
  paxos: immediately sync commitlog entries for writes made by paxos learn stage
  paxos: mark paxos table schema as "always sync"
  schema: allow schema to be marked as 'always sync to commitlog'
  commitlog: add test for per entry sync mode
  database: pass sync flag from db::apply function to the commitlog
  commitlog: add sync method to entry_writer
2020-01-16 12:05:50 +02:00
Rafael Ávila de Espíndola
2ebd1463b2 tests: Handle null and not present values differently
Before this patch result_set_assertions was handling both null values
and missing values in the same way.

This patch changes the handling of missing values so that now checking
for a null value is not the same as checking for a value not being
present.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200114184116.75546-1-espindola@scylladb.com>
2020-01-16 12:05:50 +02:00
Botond Dénes
0c52c2ba50 data: make cell::make_collection(): more consistent and safer
3ec889816 changed cell::make_collection() to take different code paths
depending whether its `data` argument is nothrow copyable/movable or
not. In case it is not, it is wrapped in a view to make it so (see the
above mentioned commit for a full explanation), relying on the methods
pre-existing requirement for callers to keep `data` alive while the
created writer is in use.
On closer look however it turns out that this requirement is neither
respected, nor enforced, at least not on the code level. The real
requirement is that the underlying data represented by `data` is kept
alive. If `data` is a view, it is not expected to be kept alive and
callers don't, it is instead copied into `make_collection()`.
Non-views however *are* expected to be kept alive. This makes the API
error prone.
To avoid any future errors due to this ambiguity, require all `data`
arguments to be nothrow copyable and movable. Callers are now required
to pass views of nonconforming objects.

This patch is a usability improvement and is not fixing a bug. The
current code works as-is because it happens to conform to the underlying
requirements.

Refs: #5575
Refs: #5341

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200115084520.206947-1-bdenes@scylladb.com>
2020-01-16 12:05:50 +02:00
Amnon Heiman
ac8aac2b53 tests/cql_query_test: Add schema describe tests
This patch adds tests for the describe method.

test_describe_simple_schema tests regular tables.

test_describe_view_schema tests view and index.

Each test, create a table, find the schema, call the describe method and
compare the results to the string that was used to create the table.

The view tests also verify that adding an index or view does not change
the base table.

When comparing results, leading and trailing white spaces are ignored
and all combination of whitespaces and new lines are treated equaly.

Additional tests may be added at a future phase if required.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-01-15 15:07:57 +02:00
Amnon Heiman
028525daeb database: add schema.cql file when creating a snapshot
When creating a snapshot we need to add a schema.cql file in the
snapshot directory that describes the table in that snapshot.

This patch adds the file using the schema describe method.

get_snapshot_details and manifest_json_filter were modified to ignore
the schema.cql file.

Fixes #4192

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-01-15 15:06:00 +02:00
Amnon Heiman
82367b325a schema: Add a describe method
This patch adds a describe method to a table schema.

It acts similar to a DESCRIBE cql command that is implemented in a CQL
driver.

The method supports tables, secondary indexes local indexes and
materialize views.

relates to: #4192

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-01-15 15:06:00 +02:00
Amnon Heiman
6f58d51c83 secondary_index_manager: add the index_name_from_table_name function
index_name_from_table_name is a reverse of index_table_name,
it gets a table name that was generated for an index and return the name
of the index that generated that table.

Relates to #4192

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-01-15 15:06:00 +02:00
Pavel Emelyanov
555856b1cd migration_manager: Use in-place value factory
The factory is purely a state-less thing, there is no difference what
instance of it to use, so we may omit referencing the storage_service
in passive_announce

This is 2nd simple migration_manager -> storage_service link to cut
(more to come later).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:29:21 +03:00
Pavel Emelyanov
f129d8380f migration_manager: Get database through storage_proxy
There are several places where migration_manager needs storage_service
reference to get the database from, thus forming the mutual dependency
between them. This is the simplest case where the migration_manager
link to the storage_service can be cut -- the databse reference can be
obtained from storage_proxy instead.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:29:21 +03:00
Pavel Emelyanov
5cf365d7e7 database: Explicitly pass migration_manager through init_non_system_keyspace
This is the last place where database code needs the migration_manager
instance to be alive, so now the mutual dependency between these two
is gone, only the migration_manager needs the database, but not the
vice-versa.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:29:21 +03:00
Pavel Emelyanov
ebebf9f8a8 database: Do not request migration_manager instance for passive_announce
The helper in question is static, so no need to play with the
migration_manager instances.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:29:21 +03:00
Pavel Emelyanov
3f84256853 migration_manager: Remove register/unregister helpers
In the 2nd patch the migration_manager kept those for
simpler patching, but now we can drop it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:29:21 +03:00
Pavel Emelyanov
9e4b41c32a tests: Switch on migration notifier
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:29:21 +03:00
Pavel Emelyanov
9d31bc166b cdc: Use migration_notifier to (un)register for events
If no one provided -- get it from storage_service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:29:19 +03:00
Pavel Emelyanov
ecab51f8cc storage_service: Use migration_notifier (and stop worrying)
The storage_server needs migration_manager for notifications and
carefully handles the manager's stop process not to demolish the
listeners list from under itself. From now on this dependency is
no longer valid (however the storage_service seems still need the
migration_manager, but this is different story).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:28:21 +03:00
Pavel Emelyanov
7814ed3c12 cql_server: Use migration_notifier in events_notifier
This patch removes an implicit cql_server -> migration_manager
dependency, as the former's event notifier uses the latter
for notifications.

This dependency also breaks a loop:
storage_service -> cql_server -> migration_manager -> storage_service

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:28:21 +03:00
Pavel Emelyanov
d9edcb3f15 query_processor: Use migration_notifier
This patch breaks one (probably harmless but still) dependency
loop. The query_processor -> migration_manager -> storage_proxy
 -> tracing -> query_processor.

The first link is not not needed, as the query_processor needs the
migration_manager purely to (ub)subscribe on notifications.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:28:21 +03:00
Pavel Emelyanov
2735024a53 auth: Use migration_notifier
The same as with view builder. The constructor still needs both,
but the life-time reference is now for notifier only.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:28:21 +03:00
Pavel Emelyanov
28f1250b8b view_builder: Use migration notifier
The migration manager itself is still needed on start to wait
for schema agreement, but there's no longer the need for the
life-time reference on it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:28:21 +03:00
Pavel Emelyanov
7cfab1de77 database: Switch on mnotifier from migration_manager
Do not call for local migration manager instance to send notifications,
call for the local migration notifier, it will always be alive.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:28:21 +03:00
Pavel Emelyanov
f45b23f088 storage_service: Keep migration_notifier
The storage service will need this guy to initialize sub-services
with. Also it registers itself with notifiers.

That said, it's convenient to have the migration notifier on board.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:28:21 +03:00
Pavel Emelyanov
e327feb77f database: Prepare to use on-database migration_notifier
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:28:21 +03:00
Pavel Emelyanov
f240d5760c migration_manager: Split notifier from main class
The _listeners list on migration_manager class and the corresponding
notify_xxx helpers have nothing to do with the its instances, they
are just transport for notification delivery.

At the same time some services need the migration manager to be alive
at their stop time to unregister from it, while the manager itself
may need them for its needs.

The proposal is to move the migration notifier into a complete separate
sharded "service". This service doesn't need anything, so it's started
first and stopped last.

While it's not effectively a "migration" notifier, we inherited the name
from Cassandra and renaming it will "scramble neurons in the old-timers'
brains but will make it easier for newcomers" as Avi says.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:28:19 +03:00
Pavel Emelyanov
074cc0c8ac migration_manager: Helpers for on_before_ notifications
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:27:27 +03:00
Pavel Emelyanov
1992755c72 storage_service: Kill initialization helper from init.cc
The helper just makes further patching more complex, so drop it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-15 14:27:27 +03:00
Konstantin Osipov
a665fab306 test.py: introduce BoostTest and virtualize custom boost arguments 2020-01-15 13:37:25 +03:00
Gleb Natapov
51672e5990 paxos: immediately sync commitlog entries for writes made by paxos learn stage 2020-01-15 12:15:42 +02:00
Gleb Natapov
0fc48515d8 paxos: mark paxos table schema as "always sync"
We want all writes to paxos table to be persisted on a storage before
declared completed.
2020-01-15 12:15:42 +02:00
Gleb Natapov
16e0fc4742 schema: allow schema to be marked as 'always sync to commitlog'
All writes that uses this schema will be immediately persisted on a
storage.
2020-01-15 12:15:42 +02:00
Gleb Natapov
0ce70c7a04 commitlog: add test for per entry sync mode 2020-01-15 12:15:42 +02:00
Gleb Natapov
29574c1271 database: pass sync flag from db::apply function to the commitlog
Allow upper layers to request a mutation to be persisted on a disk before
making future ready independent of which mode commitlog is running in.
2020-01-15 12:15:42 +02:00
Gleb Natapov
e0bc4aa098 commitlog: add sync method to entry_writer
If the method returns true commitlog should sync to file immediately
after writing the entry and wait for flush to complete before returning.
2020-01-15 12:15:42 +02:00
Piotr Sarna
9aab75db60 alternator: clean up single value rjson comparator
The comparator is refreshed to ensure the following:
 - null compares less to all other types;
 - null, true and false are comparable against each other,
   while other types are only comparable against themselves and null.

Comparing mixed types is not currently reachable from the alternator
API, because it's only used for sets, which can only use
strings, binary blobs and numbers - thus, no new pytest cases are added.

Fixes #5454
2020-01-15 10:57:49 +02:00
Juliusz Stasiewicz
d87d01b501 storage_proxy: intercept rpc::closed_error if counter leader is down (#5579)
When counter mutation is about to be sent, a leader is elected, but
if the leader fails after election, we get `rpc::closed_error`. The
exception propagates high up, causing all connections to be dropped.

This patch intercepts `rpc::closed_error` in `storage_proxy::mutate_counters`
and translates it to `mutation_write_failure_exception`.

References #2859
2020-01-15 09:56:45 +01:00
Konstantin Osipov
a351ea57d5 test.py: sort tests within a suite, and sort suites
This makes it easier to navigate the test artefacts.

No need to sort suites since they are already
stored in a dict.
2020-01-15 11:41:19 +03:00
Konstantin Osipov
ba87e73f8e test.py: add a basic CQL test 2020-01-15 11:41:19 +03:00
Konstantin Osipov
44d31db1fc test.py: add CQL .reject files to gitignore
To avoid accidental commit, add .reject files to .gitignore
2020-01-15 11:41:19 +03:00
Konstantin Osipov
4f64f0c652 test.py: print a colored unidiff in case of test failure
Print a colored unidiff between result and reject files in case of test
failure.
2020-01-15 11:41:19 +03:00
Konstantin Osipov
d3f9e64028 test.py: add CqlTestSuite to run CQL tests
Run the test and compare results. Manage temporary
and .reject files.

Now that there are CQL tests, improve logging.

run_test success no longer means test success.
2020-01-15 11:41:19 +03:00
Konstantin Osipov
b114bfe0bd test.py: initial import of CQL test driver, cql_repl
cql_repl is a simple program which reads CQL from stdin,
executes it, and writes results to stdout.

It support --input, --output and --log options.
--log is directed to cql_test.log by default.
--input is stdin by default
--output is stdout by default.

The result set output is print with a basic
JSON visitor.
2020-01-15 11:41:16 +03:00
Konstantin Osipov
0ec27267ab test.py: remove custom colors and define a color palette
Using a standard Python module improves readability,
and allows using colors easily in other output.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
0165413405 test.py: split test output per test mode
Store test temporary files and logs in ${testdir}/${mode}.
Remove --jenkins and --xunit, and always write XML
files at a predefined location: ${testdir}/${mode}/xml/.

Use .xunit.xml extension for tests which XML output is
in xunit format, and junit.xml for an accumulated output
of all non-boost tests in junit format.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
4095ab08c8 test.py: remove tests_to_run
Avoid storing each test twice, use per-tests
list to construct a global iterable.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
169128f80b test.py: virtualize Test.run(), to introduce CqlTest.Run next 2020-01-15 10:53:24 +03:00
Konstantin Osipov
d05f6c3cc7 test.py: virtualize test search pattern per TestSuite
CQL tests have .cql extension, while unit tests
have .cc.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
abcc182ab3 test.py: virtualize write_xunit_report()
Make sure any non-boost test can participate in the report.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
18aafacfad test.py: ensure print_summary() is agnostic of test type
Introduce a virtual Test.print_summary() to print
a failed test summary.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
21fbe5fa81 test.py: tidy up print_summary()
Now that we have tabular output, make print_summary()
more concise.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
c171882b51 test.py: introduce base class Test for CQL and Unit tests 2020-01-15 10:53:24 +03:00
Konstantin Osipov
fd6897d53e test.py: move the default arguments handling to UnitTestSuite
Move UnitTeset default seastar argument handling to UnitTestSuite
(cleanup).
2020-01-15 10:53:24 +03:00
Konstantin Osipov
d3126f08ed test.py: move custom unit test command line arguments to suite.yaml
Load the command line arguments, if any, from suite.yaml, rather
than keep them hard-coded in test.py.

This is allows operations team to have easier access to these.

Note I had to sacrifice dynamic smp count for mutation_reader_test
(the new smp count is fixed at 3) since this is part
of test configuration now.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
ef6cebcbd2 test.py: move command line argument processing to UnitTestSuite 2020-01-15 10:53:24 +03:00
Konstantin Osipov
4a20617be3 test.py: introduce add_test(), which is suite-specific 2020-01-15 10:53:24 +03:00
Konstantin Osipov
7e10bebcda test.py: move long test list to suite.yaml
Use suite.yaml for long tests
2020-01-15 10:53:24 +03:00
Konstantin Osipov
32ffde91ba test.py: move test id assignment to TestSuite
Going forward finding and creating tests will be
a responsibility of TestSuite, so the id generator
needs to be shared.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
b5b4944111 test.py: move repeat handling to TestSuite
This way we can avoid iterating over all tests
to handle --repeat.
Besides, going forward the tests will be stored
in two places: in the global list of all tests,
for the runner, and per suite, for suite-based
reporting, so it's easier if TestSuite
if fully responsible for finding and adding tests.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
34a1b49fc3 test.py: move add_test_list() to TestSuite 2020-01-15 10:53:24 +03:00
Konstantin Osipov
44e1c4267c test.py: introduce test suites
- UnitTestSuite - for test/unit tests
- BoostTestSuite - a tweak on UnitTestSuite, with options
  to log xml test output to a dedicated file
2020-01-15 10:53:24 +03:00
Konstantin Osipov
eed3201ca6 test.py: use path, rather than test kind, for search pattern
Going forward there may be multiple suites of the same kind.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
f95c97667f test.py: support arbitrary number of test suites
Scan entire test/ for folders that contain suite.yaml,
and load tests from these folders. Skip the rest.

Each folder with a suite.yaml is expected to have a valid
suite configuration in the yaml file.

A suite is a folder with test of the same type. E.g.
it can be a folder with unit tests, boost tests, or CQL
tests.

The harness will use suite.yaml to create an appropriate
suite test driver, to execute tests in different formats.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
c1f8169cd4 test.py: add suite.yaml to boost and unit tests
The plan is to move suite-specific settings to the
configuration file.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
ec9ad04c8a test.py: move 'success' to TestUnit class
There will be other success attributes: program return
status 0 doesn't mean the test is successful for all tests.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
b4aa4d35c3 test.py: save test output in tmpdir
It is handy to have it so that a reference of a failed
test is available without re-running it.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
f4efe03ade test.py: always produce xml output, derive output paths from tmpdir
It reduces the number of configurations to re-test when test.py is
modified.  and simplifies usage of test.py in build tools, since you no
longer need to bother with extra arguments.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
d2b546d464 test.py: output job count in the log 2020-01-15 10:53:24 +03:00
Konstantin Osipov
233f921f9d test.py: make test output brief&tabular
New format:

% ./test.py --verbose --mode=release
================================================================================
[N/TOTAL] TEST                                                 MODE   RESULT
------------------------------------------------------------------------------
[1/111]   boost/UUID_test                                    release  [ PASS ]
[2/111]   boost/enum_set_test                                release  [ PASS ]
[3/111]   boost/like_matcher_test                            release  [ PASS ]
[4/111]   boost/observable_test                              release  [ PASS ]
[5/111]   boost/allocation_strategy_test                     release  [ PASS ]
^C
% ./test.py foo
================================================================================
[N/TOTAL] TEST                                                 MODE   RESULT
------------------------------------------------------------------------------
[3/3]     unit/memory_footprint_test                          debug   [ PASS ]
------------------------------------------------------------------------------
2020-01-15 10:53:24 +03:00
Konstantin Osipov
879bea20ab test.py: add a log file
Going forward I'd like to make terminal output brief&tabular,
but some test details are necessary to preserve so that a failure
is easy to debug. This information now goes to the log file.

- open and truncate the log file on each harness start
- log options of each invoked test in the log, so that
  a failure is easy to reproduce
- log test result in the log

Since tests are run concurrently, having an exact
trace of concurrent execution also helps
debugging flaky tests.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
cbee76fb95 test.py: gitignore the default ./test.py tmpdir, ./testlog 2020-01-15 10:53:24 +03:00
Konstantin Osipov
1de69228f1 test.py: add --tmpdir
It will be used for test log files.
2020-01-15 10:53:24 +03:00
Konstantin Osipov
caf742f956 test.py: flake8 style fix 2020-01-15 10:53:24 +03:00
Konstantin Osipov
dab364c87d test.py: sort imports 2020-01-15 10:53:24 +03:00
Konstantin Osipov
7ec4b98200 test.py: make name a positional argument.
Accept multiple test names, treat test name
as a substring, and if the same name is given
multiple times, run the test multiple times.
2020-01-15 10:53:24 +03:00
Dejan Mircevski
bb2e04cc8b alternator: Improve comments on comparators
Some comparator methods in conditions.cc use unexpected operators;
explain why.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-01-14 22:25:55 +02:00
Tomasz Grabiec
c8a5a27bd9 Merge "storage_service: Move load_broadcaster away" from Pavel E.
The storage_service struct is a collection of diverse things,
most of them requiring only on start and on stop and/or runing
on shard 0 (but is nonetheless sharded).

As a part of clearing this structure and generated by it inter-
-componenes dependencies, here's the sanitation of load_broadcaster.
2020-01-14 19:26:06 +01:00
Calle Wilund
313ed91ab0 cdc: Listen for migration callbacks on all shards
Fixes #5582

... but only populate log on shard 0.

Migration manager callbacks are slightly assymetric. Notifications
for pre-create/update mutations are sent only on initiating shard
(neccesary, because we consider the mutations mutable).
But "created" callbacks are sent on all shards (immutable).

We must subscribe on all shards, but still do population of cdc table
only once, otherwise we can either miss table creat or populate
more than once.

v2:
- Add test case
Message-Id: <20200113140524.14890-1-calle@scylladb.com>
2020-01-14 16:35:41 +01:00
Avi Kivity
2138657d3a Update seastar submodule
* seastar 36cf5c5ff0...3f3e117de3 (16):
  > memcached: don't use C++17-only std::optional
  > reactor: Comment why _backend is assigned in constructor body
  > log: restore --log-to-stdout for backward compatibility
  > used_size.hh: Include missing headers
  > core: Move some code from reactor.cc to future.cc
  > future-util: move parallel_for_each to future-util.cc
  > task: stop wrapping tasks with unique_ptr
  > Merge "Setup timer signal handler in backend constructor" from Pavel
Fixes #5524
  > future: avoid a branch in future's move constructor if type is trivial
  > utils: Expose used_size
  > stream: Call get_future early
  > future-util: Move parallel_for_each_state code to a .cc
  > memcached: log exceptions
  > stream: Delete dead code
  > core: Turn pollable_fd into a simple proxy over pollable_fd_state.
  > Merge "log to std::cerr" from Benny
2020-01-14 16:56:25 +02:00
Pavel Emelyanov
e1ed8f3f7e storage_service: Remove _shadow_token_metadata
This is the part of de-bloating storage_service.

The field in question is used to temporary keep the _token_metadata
value during shard-wide replication. There's no need to have it as
class member, any "local" copy is enough.

Also, as the size of token_metadata is huge, and invoke_on_all()
copies the function for each shard, keep one local copy of metadata
using do_with() and pass it into the invoke_on_all() by reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Reviewed-by:  Asias He <asias@scylladb.com>
Message-Id: <20200113171657.10246-1-xemul@scylladb.com>
2020-01-14 16:29:10 +02:00
Rafael Ávila de Espíndola
054f5761a7 types: Refactor code into a serialize_varint helper
This is a bit cleaner and avoids a boost::multiprecision::cpp_int copy
while serializing a decimal.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200110221422.35807-1-espindola@scylladb.com>
2020-01-14 16:28:27 +02:00
Avi Kivity
6c84dd0045 cql3: update_statement: do not set query option always_return_static_content for list read-before-write
The query option always_return_static_content was added for lightweight
transations in commits e0b31dd273 (infrastructure) and 65b86d155e
(actual use). However, the flag was added unconditionally to
update_parameters::options. This caused it to be set for list
read-modify-write operations, not just for lightweight transactions.
This is a little wasteful, and worse, it breaks compatibility as old
nodes do not understand the always_return_static_content flag and
complain when they see it.

To fix, remove the always_return_static_content from
update_parameters::options and only set it from compare-and-swap
operations that are used to implement lightweight transactions.

Fixes #5593.

Reviewed-by: Gleb Natapov <gleb@scylladb.com>
Message-Id: <20200114135133.2338238-1-avi@scylladb.com>
2020-01-14 16:15:20 +02:00
Hagit Segev
ef88e1e822 CentOS RPMs: Remove target to enable general centos. 2020-01-14 14:31:03 +02:00
Alejo Sanchez
6909d4db42 cql3: BYPASS CACHE query counter
This patch is the first part of requested full scan metrics.
It implements a counter of SELECT queries with BYPASS CACHE option.

In scope of #5209

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20200113222740.506610-2-alejo.sanchez@scylladb.com>
2020-01-14 12:19:00 +02:00
Rafael Ávila de Espíndola
dca1bc480f everywhere: Use serialized(foo) instead of data_value(foo).serialize()
This is just a simple cleanup that reduces the size of another patch I
am working on and is an independent improvement.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200114051739.370127-1-espindola@scylladb.com>
2020-01-14 12:17:12 +02:00
Pavel Emelyanov
b9f28e9335 storage_service: Remove dead drain branch
The drain_in_progress variable here is the future that's set by the
drain() operation itself. Its promise is set when the drain() finishes.

The check for this future in the beginning of drain() is pointless.
No two drain()-s can run in parallels because of run_with_api_lock()
protection. Doing the 2nd drain after successfull 1st one is also
impossible due to the _operation_mode check. The 2nd drain after
_exceptioned_ (and thus incomplete) 1st one will deadlock, after
this patch will try to drain for the 2nd time, but that should by ok.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200114094724.23876-1-xemul@scylladb.com>
2020-01-14 12:07:29 +02:00
Piotr Sarna
36ec43a262 Merge "add table with connected cql clients" from Juliusz
This change introduces system.clients table, which provides
information about CQL clients connected.

PK is the client's IP address, CK consists of outgoing port number
and client_type (which will be extended in future to thrift/alternator/redis).
Table supplies also shard_id and username. Other columns,
like connection_stage, driver_name, driver_version...,
are currently empty but exist for C* compatibility and future use.

This is an ordinary table (i.e. non-virtual) and it's updated upon
accepting connections. This is also why C*'s column request_count
was not introduced. In case of abrupt DB stop, the table should not persist,
so it's being truncated on startup.

Resolves #4820
2020-01-14 10:01:07 +02:00
Avi Kivity
1f46133273 Merge "data: make cell::make_collection() exception safe" from Botond
"
Most of the code in `cell` and the `imr` infrastructure it is built on
is `noexcept`. This means that extra care must be taken to avoid rouge
exceptions as they will bring down the node. The changes introduced by
0a453e5d3a did just that - introduced rouge `std::bad_alloc` into this
code path by violating an undocumented and unvalidated assumption --
that fragment ranges passed to `cell::make_collection()` are nothrow
copyable and movable.

This series refactors `cell::make_collection()` such that it does not
have this assumption anymore and is safe to use with any range.

Note that the unit test included in this series, that was used to find
all the possible exception sources will not be currently run in any of
our build modes, due to `SEASTAR_ENABLE_ALLOC_FAILURE_INJECTION` not
being set. I plan to address this in a followup because setting this
flags fails other tests using the failure injection mechanism. This is
because these tests are normally run with the failure injection disabled
so failures managed to lurk in without anyone noticing.

Fixes: #5575
Refs: #5341

Tests: unit(dev, debug)
"

* 'data-cell-make-collection-exception-safety/v2' of https://github.com/denesb/scylla:
  test: mutation_test: add exception safety test for large collection serialization
  data/cell.hh: avoid accidental copies of non-nothrow copiable ranges
  utils/fragment_range.hh: introduce fragment_range_view
2020-01-14 10:01:06 +02:00
Nadav Har'El
5b08ec3d2c alternator: error on unsupported ScanIndexForward=false
We do not yet support the ScanIndexForward=false option for reversing
the sort order of a Query operation, as reported in issue #5153.
But even before implementing this feature, it is important that we
produce an error if a user attempts to use it - instead of outright
ignoring this parameter and giving the user wrong results. This is
what this patch does.

Before this patch, the reverse-order query in the xfailing test
test_query.py::test_query_reverse seems to succeed - yet gives
results in the wrong order. With this patch, the query itself fails -
stating that the ScanIndexForward=false argument is not supported.

Refs #5153

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200105113719.26326-1-nyh@scylladb.com>
2020-01-14 10:01:06 +02:00
Pavel Emelyanov
c4bf532d37 storage_service: Fix race in removenode/force_removenode/other
Here's another theoretical problem, that involves 3 sequential calls
to respectively removenode, force_removenode and some other operation.
Let's walk through them

First goes the removenode:
  run_with_api_lock
    _operation_in_progress = "removenode"
    storage_service::remove_node
      sleep in replicating_nodes.empty() loop

Now the force_removenode can run:

  run_with_no_api_lock
    storage_service::force_removenode
      check _operation_in_progress (not empty)
      _force_remove_completion = true
      sleep in _operation_in_progress.empty loop

Now the 1st call wakes up and:

    if _force_remove_completion == true
      throw <some exception>
  .finally() handler in run_with_api_lock
    _operation_in_progress = <empty>

At this point some other operation may start. Say, drain:

  run_with_api_lock
    _operation_in_progress = "drain"
    storage_service::drain
      ...
      go to sleep somewhere

No let's go back to the 1st op that wakes up from its sleep.
The code it executes is

    while (!ss._operation_in_progress.empty()) {
        sleep_abortable()
    }

and while the drain is running it will never exit.

However (! and this is the core of the race) should the drain
operation happen _before_ the force_removenode, another check
for _operation_in_progress would have made the latter exit with
the "Operation drain is in progress, try again" message.

Fix this inconsistency by making the check for current operation
every wake-up from the sleep_abortable.

Fixes #5591

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-14 10:01:06 +02:00
Pavel Emelyanov
cc92683894 storage_service: Fix race and deadlock in removenode/force_removenode
Here's a theoretical problem, that involves 3 sequential calls
to respectively removenode, force_removenode and removenode (again)
operations. Let's walk through them

First goes the removenode:
  run_with_api_lock
    _operation_in_progress = "removenode"
    storage_service::remove_node
      sleep in replicating_nodes.empty() loop

Now the force_removenode can run:

  run_with_no_api_lock
    storage_service::force_removenode
      check _operation_in_progress (not empty)
      _force_remove_completion = true
      sleep in _operation_in_progress.empty loop

Now the 1st call wakes up and:

    if _force_remove_completion == true
      _force_remove_completion = false
      throw <some exception>
  .finally() handler in run_with_api_lock
    _operation_in_progress = <empty>

! at this point we have _force_remove_completion = false and
_operation_in_progress = <empty>, which opens the following
opportunity for the 3d removenode:

  run_with_api_lock
    _operation_in_progress = "removenode"
    storage_service::remove_node
      sleep in replicating_nodes.empty() loop

Now here's what we have in 2nd and 3rd ops:

1. _operation_in_progress = "removenode" (set by 3rd) prevents the
   force_removenode from exiting its loop
2. _force_remove_completion = false (set by 1st on exit) prevents
   the removenode from waiting on replicating_nodes list

One can start the 4th call with force_removenode, it will proceed and
wake up the 3rd op, but after it we'll have two force_removenode-s
running in parallel and killing each other.

I propose not to set _force_remove_completion to false in removenode,
but just exit and let the owner of this flag unset it once it gets
the control back.

Fixes #5590

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-14 10:01:06 +02:00
Benny Halevy
ff55b5dca3 cql3: functions: limit sum overflow detection to integral types
Other types do not have a wider accumulator at the moment.
And static_cast<accumulator_type>(ret) != _sum evaluates as
false for NaN/Inf floating point values.

Fixes #5586

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200112183436.77951-1-bhalevy@scylladb.com>
2020-01-14 10:01:06 +02:00
Avi Kivity
e3310201dd atomic_cell_or_collection: type-aware print atomic_cell or collection components
Now that atomic_cell_view and collection_mutation_view have
type-aware printers, we can use them in the type-aware atomic_cell_or_collection
printer.
Message-Id: <20191231142832.594960-1-avi@scylladb.com>
2020-01-14 10:01:06 +02:00
Avi Kivity
931b196d20 mutation_partition: row: resolve column name when in schema-aware printer
Instead of printing the column id, print the full column name.
Message-Id: <20191231142944.595272-1-avi@scylladb.com>
2020-01-14 10:01:06 +02:00
Nadav Har'El
4aa323154e merge: Pretty print canonical_mutation objects
Merged pull request https://github.com/scylladb/scylla/pull/5533
from Avi Kivity:

canonical_mutation objects are used for schema reconciliation, which is a
fragile area and thus deserves some debugging help.

This series makes canonical_mutation objects printable.
2020-01-14 10:01:06 +02:00
Takuya ASADA
5241deda2d dist: nonroot: fix CLI tool path for nonroot (#5584)
CLI tool path is hardcorded, need to specify correct path on nonroot.
2020-01-14 10:01:06 +02:00
Nadav Har'El
1511b945f8 merge: Handle multiple regular base columns in view pk
Merged patch series from Piotr Sarna:

"Previous assumption was that there can only be one regular base column
in the view key. The assumption is still correct for tables created
via CQL, but it's internally possible to create a view with multiple
such columns - the new assumption is that if there are multiple columns,
they share their liveness.

This series is vital for indexing to work properly on alternator,
so it would be best to solve the issue upstream. I strived to leave
the existing semantics intact as long as only up to one regular
column is part of the materialized view primary key, which is the case
for Scylla's materialized views. For alternator it may not be true,
but all regular columns in alternator share liveness info (since
alternator does not support per-column TTL), which is sufficient
to compute view updates in a consistent way.

Fixes #5006
Tests: unit(dev), alternator(test_gsi_update_second_regular_base_column, tic-tac-toe demo)"

Piotr Sarna (3):
  db,view: fix checking if partition key is empty
  view: handle multiple regular base columns in view pk
  test: add a case for multiple base regular columns in view key

 alternator-test/test_gsi.py              |  1 -
 view_info.hh                             |  5 +-
 cql3/statements/alter_table_statement.cc |  2 +-
 db/view/view.cc                          | 77 ++++++++++++++----------
 mutation_partition.cc                    |  2 +-
 test/boost/cql_query_test.cc             | 58 ++++++++++++++++++
 6 files changed, 109 insertions(+), 36 deletions(-)
2020-01-14 10:01:00 +02:00
Nadav Har'El
f16e3b0491 merge: bouncing lwt request to an owning shard
Merged patch series from Gleb Natapov:

"LWT is much more efficient if a request is processed on a shard that owns
a token for the request. This is because otherwise the processing will
bounce to an owning shard multiple times. The patch proposes a way to
move request to correct shard before running lwt.  It works by returning
an error from lwt code if a shard is incorrect one specifying the shard
the request should be moved to. The error is processed by the transport
code that jumps to a correct shard and re-process incoming message there.

The nicer way to achieve the same would be to jump to a right shard
inside of the storage_proxy::cas(), but unfortunately with current
implementation of the modification statements they are unusable by
a shard different from where it was created, so the jump should happen
before a modification statement for an cas() is created. When we fix our
cql code to be more cross-shard friendly this can be reworked to do the
jump in the storage_proxy."

Gleb Natapov (4):
  transport: change make_result to takes a reference to cql result
    instead of shared_ptr
  storage_service: move start_native_transport into a thread
  lwt: Process lwt request on a owning shard
  lwt: drop invoke_on in paxos_state prepare and accept

 auth/service.hh                           |   5 +-
 message/messaging_service.hh              |   2 +-
 service/client_state.hh                   |  30 +++-
 service/paxos/paxos_state.hh              |  10 +-
 service/query_state.hh                    |   6 +
 service/storage_proxy.hh                  |   2 +
 transport/messages/result_message.hh      |  20 +++
 transport/messages/result_message_base.hh |   4 +
 transport/request.hh                      |   4 +
 transport/server.hh                       |  25 ++-
 cql3/statements/batch_statement.cc        |   6 +
 cql3/statements/modification_statement.cc |   6 +
 cql3/statements/select_statement.cc       |   8 +
 message/messaging_service.cc              |   2 +-
 service/paxos/paxos_state.cc              |  48 ++---
 service/storage_proxy.cc                  |  47 ++++-
 service/storage_service.cc                | 120 +++++++------
 test/boost/cql_query_test.cc              |   1 +
 thrift/handler.cc                         |   3 +
 transport/messages/result_message.cc      |   5 +
 transport/server.cc                       | 203 ++++++++++++++++------
 21 files changed, 377 insertions(+), 180 deletions(-)
2020-01-14 09:59:59 +02:00
Botond Dénes
300728120f test: mutation_test: add exception safety test for large collection serialization
Use `seastar::memory::local_failure_injector()` to inject al possible
`std::bad_alloc`:s into the collection serialization code path. The test
just checks that there are no `std::abort()`:s caused by any of the
exceptions.

The test will not be run if `SEASTAR_ENABLE_ALLOC_FAILURE_INJECTION` is
not defined.
2020-01-13 16:53:35 +02:00
Botond Dénes
3ec889816a data/cell.hh: avoid accidental copies of non-nothrow copiable ranges
`cell::make_collection()` assumes that all ranges passed to it are
nothrow copyable and movable views. This is not guaranteed, is not
expressed in the interface and is not mentioned in the comments either.
The changes introduced by 0a453e5d3a to collection serialization, making
it use fragmented buffers, fell into this trap, as it passes
`bytes_ostream` to `cell::make_collection()`. `bytes_ostream`'s copy
constructor allocates and hence can throw, triggering an
`std::terminate()` inside `cell::make_collection()` as the latter is
noexcept.

To solve this issue, non-nothrow copyable and movable ranges are now
wrapped in a `fragment_range_view` to make them so.
`cell::make_collection()` already requires callers to keep alive the
range for the duration of the call, so this does not introduce any new
requirements to the callers. Additionally, to avoid any future
accidents, do not accept temporaries for the `data` parameter. We don't
ever want to move this param anyway, we will either have a trivially
copyable view, or a potentially heavy-weight range that we will create a
trivially copyable view of.
2020-01-13 16:53:35 +02:00
Botond Dénes
b52b4d36a2 utils/fragment_range.hh: introduce fragment_range_view
A lightweight, trivially copyable and movable view for fragment ranges.
Allows for uniform treatment of all kinds of ranges, i.e. treating all
of them as a view. Currently `fragment_range.hh` provides lightweight,
view-like adaptors for empty and single-fragment ranges (`bytes_view`). To
allow code to treat owning multi-fragment ranges the shame way as the
former two, we need a view for the latter as well -- this is
`fragment_range_view`.
2020-01-13 16:52:59 +02:00
Calle Wilund
75f2b2876b cdc: Remove free function for mutation augmentation 2020-01-13 13:18:55 +00:00
Calle Wilund
3eda3122af cdc: Move mutation augment from cql3::modification_statement to storage proxy
Using the attached service object
2020-01-13 13:18:55 +00:00
Juliusz Stasiewicz
27dfda0b9e main/transport: using the infrastructure of system.clients
Resolves #4820. Execution path in main.cc now cleans up system.clients
table if it exists (this is done on startup). Also, server.cc now calls
functions that notify about cql clients connecting/disconnecting.
2020-01-13 14:07:04 +01:00
Pavel Emelyanov
148da64a7e storage_servce: Move load_broadcaster away
This simplifies the storage_service API and fixes the
complain about shared_ptr usage instead of unique_ptr.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-13 13:55:09 +03:00
Pavel Emelyanov
b6e1e6df64 misc_services: Introduce load_meter
There's a lonely get_load_map() call on storage_service that
needs only load broadcaster, always runs on shard 0 and that's it.

Next patch will move this whole stuff into its own helper no-shard
container and this is preparation for this.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-13 13:53:08 +03:00
Gleb Natapov
5753ab7195 lwt: drop invoke_on in paxos_state prepare and accept
Since lwt requests are now running on an owning shard there is no longer
a need to invoke cross shard call on paxos_state level. RPC calls may
still arrive to a wrong shard so we need to make cross shard call there.
2020-01-13 10:26:02 +02:00
Gleb Natapov
d28dd4957b lwt: Process lwt request on a owning shard
LWT is much more efficient if a request is processed on a shard that owns
a token for the request. This is because otherwise the processing will
bounce to an owning shard multiple times. The patch proposes a way to
move request to correct shard before running lwt.  It works by returning
an error from lwt code if a shard is incorrect one specifying the shard
the request should be moved to. The error is processed by transport code
that jumps to a correct shard and re-process incoming message there.
2020-01-13 10:26:02 +02:00
Piotr Sarna
3853594108 alternator-test: turn off TLS self-signed verification
Two test cases did not ignore TLS self-signed warnings, which are used
locally for testing HTTPS.

Fixes #5557

Tests(test_health, test_authorization)
Message-Id: <8bda759dc1597644c534f94d00853038c2688dd7.1578394444.git.sarna@scylladb.com>
2020-01-10 15:31:30 +02:00
Rafael Ávila de Espíndola
5313828ab8 cql3: Fix indentation
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200109025855.10591-2-espindola@scylladb.com>
2020-01-09 10:42:55 +02:00
Rafael Ávila de Espíndola
4da6dc1a7f cql3: Change a lambda capture order to match another
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200109025855.10591-1-espindola@scylladb.com>
2020-01-09 10:42:49 +02:00
Avi Kivity
6d454d13ac db/schema_tables: make gratuitous generic lambdas in do_merge_schema() concrete
Those gratuitous lambdas make life harder for IDE users by hiding the actual
types from the IDEs.
Message-Id: <20200107154746.1918648-1-avi@scylladb.com>
2020-01-08 17:43:18 +01:00
Avi Kivity
454074f284 Merge "database: Avoid OOMing with flush continuations after failed memtable flush" from Tomasz
"
The original fix (10f6b125c8) didn't
take into account that if there was a failed memtable flush (Refs
flush) but is not a flushable memtable because it's not the latest in
the memtable list. If that happens, it means no other memtable is
flushable as well, cause otherwise it would be picked due to
evictable_occupancy(). Therefore the right action is to not flush
anything in this case.

Suspected to be observed in #4982. I didn't manage to reproduce after
triggering a failed memtable flush.

Fixes #3717
"

* tag 'avoid-ooming-with-flush-continuations-v2' of github.com:tgrabiec/scylla:
  database: Avoid OOMing with flush continuations after failed memtable flush
  lsa: Introduce operator bool() to occupancy_stats
  lsa: Expose region_impl::evictable_occupancy in the region class
2020-01-08 16:58:54 +02:00
Gleb Natapov
feed544c5d paxos: fix truncation time checking during learn stage
The comparison is done in millisecons, not microseconds.

Fixes #5566

Message-Id: <20200108094927.GN9084@scylladb.com>
2020-01-08 14:37:07 +01:00
Gleb Natapov
2832f1d9eb storage_service: move start_native_transport into a thread
The code runs only once and it is simple if it runs in a seastar thread.
2020-01-08 14:57:57 +02:00
Gleb Natapov
7fb2e8eb9f transport: change make_result to takes a reference to cql result instead of shared_ptr 2020-01-08 14:57:57 +02:00
Avi Kivity
0bde5906b3 Merge "cql3: detect and handle int overflow in aggregate functions #5537" from Benny
"
Fix overflow handling in sum() and avg().

sum:
 - aggregated into __int128
 - detect overflow when computing result and log a warning if found

avg:
 - fix division function to divide the accumulator type _sum (__int128 for integers) by _count

Add unit tests for both cases

Test:
  - manual test against Cassandra 3.11.3 to make sure the results in the scylla unit test agree with it.
  - unit(dev), cql_query_test(debug)

Fixes #5536
"

* 'cql3-sum-overflow' of https://github.com/bhalevy/scylla:
  test: cql_query_test: test avg overflow
  cql3: functions: protect against int overflow in avg
  test: cql_query_test: test sum overflow
  cql3: functions: detect and handle int overflow in sum
  exceptions: sort exception_code definitions
  exceptions: define additional cassandra CQL exceptions codes
2020-01-08 10:39:38 +02:00
Avi Kivity
d649371baa Merge "Fix crash on SELECT SUM(udf(...))" from Rafael
"
We were failing to start a thread when the UDF call was nested in an
aggregate function call like SUM.
"

* 'espindola/fix-sum-of-udf' of https://github.com/espindola/scylla:
  cql3: Fix indentation
  cql3: Add missing with_thread_if_needed call
  cql3: Implement abstract_function_selector::requires_thread
  remove make_ready_future call
2020-01-08 10:25:42 +02:00
Benny Halevy
dafbd88349 query: initialize read_command timestamp to now
This was initialized to api::missing_timestamp but
should be set to either a client provided-timestamp or
the server's.

Unlike write operations, this timestamp need not be unique
as the one generated by client_state::get_timestamp.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200108074021.282339-2-bhalevy@scylladb.com>
2020-01-08 10:19:07 +02:00
Benny Halevy
39325cf297 storage_proxy: fix int overflow in service::abstract_read_executor::execute
exec->_cmd->read_timestamp may be initialized by default to api::min_timestamp,
causing:
  service/storage_proxy.cc:3328:116: runtime error: signed integer overflow: 1577983890961976 - -9223372036854775808 cannot be represented in type 'long int'
  Aborting on shard 1.

Do not optimize cross-dc repair if read_timestamp is missing (or just negative)
We're interested in reads that happen within write_timeout of a write.

Fixes #5556

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200108074021.282339-1-bhalevy@scylladb.com>
2020-01-08 10:18:59 +02:00
Raphael S. Carvalho
390c8b9b37 sstables: Move STCS implementation to source file
header only implementation potentially create a problem with duplicate symbols

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200107154258.9746-1-raphaelsc@scylladb.com>
2020-01-08 09:55:35 +02:00
Benny Halevy
20a0b1a0b6 test: cql_query_test: test avg overflow
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-08 09:50:50 +02:00
Benny Halevy
1c81422c1b cql3: functions: protect against int overflow in avg
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-08 09:48:33 +02:00
Benny Halevy
9053ef90c7 test: cql_query_test: test sum overflow
Add unit tests for summing up int's and bigint's
with possible handling of overflow.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-08 09:48:33 +02:00
Benny Halevy
e97a111f64 cql3: functions: detect and handle int overflow in sum
Detect integer overflow in cql sum functions and throw an error.
Note that Cassandra quietly truncates the sum if it doesn't fit
in the input type but we rather break compatibility in this
case. See https://issues.apache.org/jira/browse/CASSANDRA-4914?focusedCommentId=14158400&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14158400

Fixes #5536

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-08 09:48:33 +02:00
Benny Halevy
98260254df exceptions: sort exception_code definitions
Be compatible with Cassandra source.
It's easier to maintain this way.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-08 09:48:21 +02:00
Benny Halevy
30d0f1df75 exceptions: define additional cassandra CQL exceptions codes
As of e9da85723a

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-01-08 09:40:57 +02:00
Rafael Ávila de Espíndola
282228b303 cql3: Fix indentation
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-07 22:14:50 -08:00
Rafael Ávila de Espíndola
4316bc2e18 cql3: Add missing with_thread_if_needed call
This fixes an assert when doing sum(udf(...)).

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-07 22:14:50 -08:00
Rafael Ávila de Espíndola
d301d31de0 cql3: Implement abstract_function_selector::requires_thread
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-07 22:14:24 -08:00
Rafael Ávila de Espíndola
dc9b3b8ff2 remove make_ready_future call
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-07 22:10:27 -08:00
Calle Wilund
9f6b22d882 cdc: Assign self to storage proxy object 2020-01-07 12:01:58 +00:00
Calle Wilund
fc5904372b storage_proxy: Add (optional) cdc service object pointer member
The cdc service is assigned from outside, post construction, mainly
because of the chickens and eggs in main startup. Would be nice to
have it unconditionally, but this is workable.
2020-01-07 12:01:58 +00:00
Calle Wilund
d6003253dd storage_proxy: Move mutate_counters to private section
It is (and shall) only be called from inside storage proxy,
and we would like this to be reflected in the interface
so our eventual moving of cdc logic into the mutate call
chains become easier to verify and comprehend.
2020-01-07 12:01:58 +00:00
Calle Wilund
b6c788fccf cdc: Add augmentation call to cdc service
To eventually replace the free function.
Main difference is this is build to both handle batches correctly
and to eventually allow hanging cdc object on storage proxy,
and caches on the cdc object.
2020-01-07 12:01:58 +00:00
Piotr Sarna
04dc8faec9 test: add a case for multiple base regular columns in view key
The test case checks that having two base regular columns
in the materialized view key (not obtainable via CQL),
still works fine when values are inserted or deleted.
If TTL was involved and these columns would have different expiration
rules, the case would be more complicated, but it's not possible
for a user to reach that case - neither with CQL, nor with alternator.
2020-01-07 12:19:06 +01:00
Piotr Sarna
155a47cc55 view: handle multiple regular base columns in view pk
Previous assumption was that there can only be one regular base column
in the view key. The assumption is still correct for tables created
via CQL, but it's internally possible to create a view with multiple
such columns - the new assumption is that if there are multiple columns,
they share their liveness.
This patch is vital for indexing to work properly on alternator,
so it would be best to solve the issue upstream. I strived to leave
the existing semantics intact as long as only up to one regular
column is part of the materialized view primary key, which is the case
for Scylla's materialized views. For alternator it may not be true,
but all regular columns in alternator share liveness info (since
alternator does not support per-column TTL), which is sufficient
to compute view updates in a consistent way.

Fixes #5006

Tests: unit(dev), alternator(test_gsi_update_second_regular_base_column, tic-tac-toe demo)

Message-Id: <c9dec243ce903d3a922ce077dc274f988bcf5d57.1567604945.git.sarna@scylladb.com>
2020-01-07 12:18:39 +01:00
Avi Kivity
6e0a073b2e mutation_partition: use type-aware printing of the clustering row
Now that position_in_partition_view has type-aware printing, use it
to provide a human readable version of clustering keys.
Message-Id: <20191231151315.602559-2-avi@scylladb.com>
2020-01-07 12:17:11 +01:00
Avi Kivity
488c42408a position_in_partition_view: add type-aware printer
If the position_in_partition_view represents a clustering key,
we can now see it with the clustering key decoded according to
the schema.
Message-Id: <20191231151315.602559-1-avi@scylladb.com>
2020-01-07 12:15:09 +01:00
Piotr Sarna
54315f89cd db,view: fix checking if partition key is empty
Previous implementation did not take into account that a column
in a partition key might exist in a mutation, but in a DEAD state
- if it's deleted. There are no regressions for CQL, while for
alternator and its capability of having two regular base columns
in a view key, this additional check must be performed.
2020-01-07 12:05:36 +01:00
Avi Kivity
3a3c20d337 schema_tables: de-templatize diff_table_or_view()
This reduces code bloat and makes the code friendlier for IDEs, as the
IDE now understands the type of create_schema.
Message-Id: <20191231134803.591190-1-avi@scylladb.com>
2020-01-07 11:56:54 +01:00
Avi Kivity
e5e42672f5 sstables: reduce bloat from sstables::write_simple()
sstables::write_simple() has quite a lot of boilerplate
which gets replicated into each template instance. Move
all of that into a non-template do_write_simple(), leaving
only things that truly depend on the component being written
in the template, and encapsulating them with a
noncopyable_function.

An explicit template instantiation was added, since this
is used in a header file. Before, it likely worked by
accident and stopped working when the template became
small enough to inline.

Tests: unit (dev)
Message-Id: <20200106135453.1634311-1-avi@scylladb.com>
2020-01-07 11:56:11 +01:00
Avi Kivity
8f7f56d6a0 schema_tables: make gratuitous generic lambda in create_tables_from_partitions() concrete
The generic lambda made IDE searches for create_table_from_table_row() fail.
Message-Id: <20191231135210.591972-1-avi@scylladb.com>
2020-01-07 11:49:10 +01:00
Avi Kivity
92fd83d3af schema_tables: make gratuitoous generic lambda in create_table_from_name() concrete
The lambda made IDE searches for read_table_mutations fail.
Message-Id: <20191231135103.591741-1-avi@scylladb.com>
2020-01-07 11:48:56 +01:00
Avi Kivity
dd6dd97df9 schema_tables: make gratuitous generic lambda in merge_tables_and_views() concrete
The generic lambda made IDE searches for create_table_from_mutations fail.
Message-Id: <20191231135059.591681-1-avi@scylladb.com>
2020-01-07 11:48:39 +01:00
Avi Kivity
c63cf02745 canonical_mutation: add pretty printing
Add type-aware printing of canonical_mutation objects.
2020-01-07 12:06:31 +02:00
Avi Kivity
e093121687 mutation_partition_view: add virtual visitor
mutation_partition_view now supports a compile-time resolved visitor.
This is performant but results in bloat when the performance is not
needed. Furthermore, the template function that applies the object
to the visitor is private and out-of-line, to reduce compile time.

To allow visitation on mutation_partition_view objects, add a virtual
visitor type and a non-template accept function.

Note: mutation_partition_visitor is very similar to the new type,
but different enough to break the template visitor which is used
to implement the new visitor.

The new visitor will be used to implement pretty printing for
canonical_mutation.
2020-01-07 12:06:31 +02:00
Avi Kivity
75d9909b27 collection_mutation_view: add type-aware pretty printer
Add a way for the user to associate a type with a collection_mutation_view
and get a nice printout.
2020-01-07 12:06:29 +02:00
Rafael Ávila de Espíndola
b80852c447 main: Explicitly allow scylla core dumps
I have not looked into the security reason for disabling it when
a program has file capabilities.

Fixes #5560

[avi: remove extraneous semicolon]
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200106231836.99052-1-espindola@scylladb.com>
2020-01-07 11:15:59 +02:00
Rafael Ávila de Espíndola
07f1cb53ea tests: run with ASAN_OPTIONS='disable_coredump=0:abort_on_error=1'
These are the same options we use in seastar.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200107001513.122238-1-espindola@scylladb.com>
2020-01-07 11:11:49 +02:00
Takuya ASADA
238a25a0f4 docker: fix typo of scylla-jmx script path (#5551)
The path should /opt/scylladb/jmx, not /opt/scylladb/scripts/jmx.

Fixes #5542
2020-01-07 10:54:16 +02:00
Asias He
401854dbaf repair: Avoid duplicated partition_end write
Consider this:

1) Write partition_start of p1
2) Write clustering_row of p1
3) Write partition_end of p1
4) Repair is stopped due to error before writing partition_start of p2
5) Repair calls repair_row_level_stop() to tear down which calls
   wait_for_writer_done(). A duplicate partition_end is written.

To fix, track the partition_start and partition_end written, avoid
unpaired writes.

Backports: 3.1 and 3.2
Fixes: #5527
2020-01-06 14:06:02 +02:00
Eliran Sinvani
e64445d7e5 debian-reloc: Propagate PRODUCT variable to renaming command in debian pkg
commit 21dec3881c introduced
a bug that will cause scylla debian build to fail. This is
because the commit relied on the environment PRODUCT variable
to be exported (and as a result, to propogate to the rename
command that is executed by find in a subshell)
This commit fixes it by explicitly passing the PRODUCT variable
into the rename command.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20200106102229.24769-1-eliransin@scylladb.com>
2020-01-06 12:31:58 +02:00
Asias He
38d4015619 gossiper: Remove HIBERNATE status from dead state
In scylla, the replacing node is set as HIBERNATE status. It is the only
place we use HIBERNATE status. The replacing node is supposed to be
alive and updating its heartbeat, so it is not supposed to be in dead
state.

This patch fixes the following problem in replacing.

   1) start n1, n2
   2) n2 is down
   3) start n3 to replace n2, but kill n3 in the middle of the replace
   4) start n4 to replace n2

After step 3 and step 4, the old n3 will stay in gossip forever until a
full cluster shutdown. Note n3 will only stay in gossip but in
system.peers table. User will see the annoying and infinite logs like on
all the nodes

   rpc - client $ip_of_n3:7000: fail to connect: Connection refused

Fixes: #5449
Tests: replace_address_test.py + manual test
2020-01-06 11:47:31 +02:00
Amos Kong
c5ec1e3ddc scylla_ntp_setup: check redhat variant version by prase_version (#5434)
VERSION_ID of centos7 is "7", but VERSION_ID of oel7.7 is "7.7"
scylla_ntp_setup doesn't work on OEL7.7 for ValueError.

- ValueError: invalid literal for int() with base 10: '7.7'

This patch changed redhat_version() to return version string, and compare
with parse_version().

Fixes #5433

Signed-off-by: Amos Kong <amos@scylladb.com>
2020-01-06 11:43:14 +02:00
Asias He
145fd0313a streaming: Fix map access in stream_manager::get_progress
When the progress is queried, e.g., query from nodetool netstats
the progress info might not be updated yet.

Fix it by checking before access the map to avoid errors like:

std::out_of_range (_Map_base::at)

Fixes: #5437
Tests: nodetool_additional_test.py:TestNodetool.netstats_test
2020-01-06 10:31:15 +02:00
Rafael Ávila de Espíndola
98cd8eddeb tests: Run with halt_on_error=1:abort_on_error=1
This depends on the just emailed fixes to undefined behavior in
tests. With this change we should quickly notice if a change
introduces undefined behavior.

Fixes #4054

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>

Message-Id: <20191230222646.89628-1-espindola@scylladb.com>
2020-01-05 17:20:31 +02:00
Rafael Ávila de Espíndola
dc5ecc9630 enum_option_test: Add explicit underlying types to enums
We expect to be able to create variables with out of range values, so
these enums needs explicit underlying types.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200102173422.68704-1-espindola@scylladb.com>
2020-01-05 17:20:31 +02:00
Nadav Har'El
f0d8dd4094 merge: CDC rolling upgrade
Merged pull request https://github.com/scylladb/scylla/pull/5538 from
Avi Kivity and Piotr Jastrzębski.

This series prepares CDC for rolling upgrade. This consists of
reducing the footprint of cdc, when disabled, on the schema, adding
a cluster feature, and redacting the cdc column when transferring
it to other nodes. The latter is needed because we'll want to backport
this to 3.2, which doesn't have canonical_mutations yet.
2020-01-05 17:13:12 +02:00
Gleb Natapov
720c0aa285 commitlog: update last sync timestamp when cycle a buffer
If in memory buffer has not enough space for incoming mutation it is
written into a file, but the code missed updating timestamp of a last
sync, so we may sync to often.
Message-Id: <20200102155049.21291-9-gleb@scylladb.com>
2020-01-05 16:13:59 +02:00
Gleb Natapov
14746e4218 commitlog: drop segment gate
The code that enters the gate never defers before leaving, so the gate
behaves like a flag. Lets use existing flag to prohibit adding data to a
closed segment.
Message-Id: <20200102155049.21291-8-gleb@scylladb.com>
2020-01-05 16:13:59 +02:00
Gleb Natapov
f8c8a5bd1f test: fix error reporting in commitlog_test
Message-Id: <20200102155049.21291-7-gleb@scylladb.com>
2020-01-05 16:13:58 +02:00
Gleb Natapov
680330ae70 commitlog: introduce segment::close() function.
Currently segment closing code is spread over several functions and
activated based on the _closed flag. Make segment closing explicit
by moving all the code into close() function and call it where _closed
flag is set.
Message-Id: <20200102155049.21291-6-gleb@scylladb.com>
2020-01-05 16:13:55 +02:00
Gleb Natapov
a1ae08bb63 commitlog: remove unused segment::flush() parameter
Message-Id: <20200102155049.21291-5-gleb@scylladb.com>
2020-01-05 16:13:55 +02:00
Gleb Natapov
1e15e1ef44 commitlog: cleanup segment sync()
Call cycle() only once.
Message-Id: <20200102155049.21291-4-gleb@scylladb.com>
2020-01-05 16:13:54 +02:00
Gleb Natapov
3d3d2c572e commitlog: move segment shutdown code from sync()
Currently sync() does two completely different things based on the
shutdown parameter. Separate code into two different function.
Message-Id: <20200102155049.21291-3-gleb@scylladb.com>
2020-01-05 16:13:54 +02:00
Gleb Natapov
89afb92b28 commitlog: drop superfluous this
Message-Id: <20200102155049.21291-2-gleb@scylladb.com>
2020-01-05 16:13:53 +02:00
Piotr Jastrzebski
95feeece0b scylla_tables: treat empty cdc props as disabled
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-05 14:39:23 +02:00
Piotr Jastrzebski
396e35bf20 cdc: add schema_change test for cdc_options
The original "test_schema_digest_does_not_change" test case ensures
that schema digests will match for older nodes that do not support
all the features yet (including computed columns).
The additional case uses sstables generated after CDC was enabled
and a table with CDC enabled is created,
in order to make sure that the digest computed
including CDC column does not change spuriously as well.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-05 14:39:23 +02:00
Piotr Jastrzebski
c08e6985cd cdc: allow cluster rolling upgrade
Addition of cdc column in scylla_tables changes how schema
digests are calculated, and affect the ABI of schema update
messages (adding a column changes other columns' indexes
in frozen_mutation).

To fix this, extend the schema_tables mechanism with support
for the cdc column, and adjust schemas and mutations to remove
that column when sending schemas during upgrade.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-05 14:39:23 +02:00
Piotr Jastrzebski
caa0a4e154 tests: disable CDC in schema_change_tests
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-05 14:39:23 +02:00
Piotr Jastrzebski
129af99b94 cdc: Return reference from cluster_supports_cdc
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-05 14:39:23 +02:00
Piotr Jastrzebski
4639989964 cdc: Add CDC_OPTIONS schema_feature
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-01-05 14:39:23 +02:00
Avi Kivity
c150f2e5d7 schema_tables, cdc: don't store empty cdc columns in scylla_tables
An empty cdc column in scylla_tables is hashed differently from
a missing column. This causes schema mismatch when a schema is
propagated to another node, because the other node will redact
the schema column completely if the cluster feature isn't enabled,
and an empty value is hashed differently from a missing value.

Store a tombstone instead. Tombstones are removed before
digesting, so they don't affect the outcome.

This change also undoes the changes in 386221da84 ("schema_tables:
 handle 'cdc' options") to schema_change_test
test_merging_does_not_alter_tables_which_didnt_change. That change
enshrined the breakage into the test, instead of fixing the root cause,
which was that we added an an extra mutation to the schema (for
cdc options, which were disabled).
2020-01-05 14:36:18 +02:00
Rafael Ávila de Espíndola
3d641d4062 lua: Use existing cpp_int cast logic
Different versions of boost have different rules for what conversions
from cpp_int to smaller intergers are allowed.

We already had a function that worked with all supported versions, but
it was not being use by lua.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200104041028.215153-1-espindola@scylladb.com>
2020-01-05 12:10:54 +02:00
Rafael Ávila de Espíndola
88b5aadb05 tests: cql_test_env: wait for two futures starting internal services
I noticed this while looking at the crashes next is currently
experiencing.

While I have no idea if this fixes the issue, it does avoid broken
future warnings (for no_sharded_instance_exception) in a debug build.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200103201540.65324-1-espindola@scylladb.com>
2020-01-05 12:09:59 +02:00
Avi Kivity
4b8e2f5003 Update seastar submodule
* seastar 0525bbb08...36cf5c5ff (6):
  > memcached: Fix use after free in shutdown
  > Revert "task: stop wrapping tasks with unique_ptr"
  > task: stop wrapping tasks with unique_ptr
  > http: Change exception formating to the generic seastar one
  > Merge "Avoid a few calls to ~exception_ptr" from Rafael
  > tests: fix core generation with asan
2020-01-03 15:48:53 +02:00
Nadav Har'El
44c2a44b54 alternator-test: test for ConditionExpression feature
This patch adds a very comprehensive test for the ConditionExpression
feature, i.e., the newer syntax of conditional writes replacing
the old-style "Expected" - for the UpdateItem, PutItem and DeleteItem
operations.

I wrote these tests while closely following the DynamoDB ConditionExpression
documentation, and attempted to cover all conceivable features, subfeatures
and subcases of the ConditionExpression syntax - to serve as a test for a
future support for this feature in Alternator (see issue #5053).

As usual, all these tests pass on AWS DynamoDB, but because we haven't yet
implemented this feature in Alternator, all but one xfail on Alternator.

Refs #5053.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191229143556.24002-1-nyh@scylladb.com>
2020-01-03 15:48:20 +02:00
Nadav Har'El
aad5eeab51 alternator: better error messages when Alternator port is taken
If Alternator is requested to be enabled on a specific port but the port is
already taken, the boot fails as expected - but the error log is confusing;
It currently looks something like this:

WARN  2019-12-24 11:22:57,303 [shard 0] alternator-server - Failed to set up Alternator HTTP server on 0.0.0.0 port 8000, TLS port 8043: std::system_error (error system:98, posix_listen failed for address 0.0.0.0:8000: Address already in use)
... (many more messages about the server shutting down)
INFO  2019-12-24 11:22:58,008 [shard 0] init - Startup failed: std::system_error (error system:98, posix_listen failed for address 0.0.0.0:8000: Address already in use)

There are two problems here. First, the "WARN" should really be an "ERROR",
because it causes the server to be shut down and the user must see this error.
Second, the final line in the log, something the user is likely to see first,
contains only the ultimate cause for the exception (an address already in use)
but not the information what this address was needed for.

This patch solves both issues, and the log now looks like:

ERROR 2019-12-24 14:00:54,496 [shard 0] alternator-server - Failed to set up Alterna
tor HTTP server on 0.0.0.0 port 8000, TLS port 8043: std::system_error (error system
:98, posix_listen failed for address 0.0.0.0:8000: Address already in use)
...
INFO  2019-12-24 14:00:55,056 [shard 0] init - Startup failed: std::_Nested_exception<std::runtime_error> (Failed to set up Alternator HTTP server on 0.0.0.0 port 8000, TLS port 8043): std::system_error (error system:98, posix_listen failed for address 0.0.0.0:8000: Address already in use)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191224124127.7093-1-nyh@scylladb.com>
2020-01-03 15:48:20 +02:00
Nadav Har'El
1f64a3bbc9 alternator: error on unsupported ReturnValues option
We don't support yet the ReturnValues option on PutItem, UpdateItem or
DeleteItem operations (see issue #5053), but if a user tries to use such
an option anyway, we silently ignore this option. It's better to fail,
reporting the unsupported option.

In this patch we check the ReturnValues option and if it is anything but
the supported default ("NONE"), we report an error.

Also added a test to confirm this fix. The test verifies that "NONE" is
allowed, and something which is unsupported (e.g., "DOG") is not ignored
but rather causes an error.

Refs #5053.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191216193310.20060-1-nyh@scylladb.com>
2020-01-03 15:48:20 +02:00
Rafael Ávila de Espíndola
dc93228b66 reloc: Turn the default flags into common flags
These are flags we always want to enable. In particular, we want them
to be used by the bots, but the bots run this script with
--configure-flags, so they were being discarded.

We put the user option later so that they can override the common
options.

Fixes #5505

Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Takuya ASADA <syuu@scylladb.com>
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-03 15:48:20 +02:00
Rafael Ávila de Espíndola
d4dfb6ff84 build-id: Handle the binary having multiple PT_NOTE headers
There is no requirement that all notes be placed in a single
PT_NOTE. It looks like recent lld's actually put each section in its
own PT_NOTE.

This change looks for build-id in all PT_NOTE headers.

Fixes #5525

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191227000311.421843-1-espindola@scylladb.com>
2020-01-03 15:48:20 +02:00
Avi Kivity
1e9237d814 dist: redhat: use parallel compression for rpm payload
rpm compression uses xz, which is painfully slow. Adjust the
compression settings to run on all threads.

The xz utility documentation suggests that 0 threads is
equivalent to all CPUs, but apparently the library interface
(which rpmbuild uses) doesn't think the same way.

Message-Id: <20200101141544.1054176-1-avi@scylladb.com>
2020-01-03 15:48:20 +02:00
Nadav Har'El
de1171181c user defined types: fix support for case-sensitive type names
In the current code, support for case-sensitive (quoted) user-defined type
names is broken. For example, a test doing:

    CREATE TYPE "PHone" (country_code int, number text)
    CREATE TABLE cf (pk blob, pn "PHone", PRIMARY KEY (pk))

Fails - the first line creates the type with the case-sensitive name PHone,
but the second line wrongly ends up looking for the lowercased name phone,
and fails with an exception "Unknown type ks.phone".

The problem is in cql3_type_name_impl. This class is used to convert a
type object into its proper CQL syntax - for example frozen<list<int>>.
The problem is that for a user-defined type, we forgot to quote its name
if not lowercase, and the result is wrong CQL; For example, a list of
PHone will be written as list<PHone> - but this is wrong because the CQL
parser, when it sees this expression, lowercases the unquoted type name
PHone and it becomes just phone. It should be list<"PHone">, not list<PHone>.

The solution is for cql3_type_name_impl to use for a user-defined type
its get_name_as_cql_string() method instead of get_name_as_string().

get_name_as_cql_string() is a new method which prints the name of the
user type as it should be in a CQL expression, i.e., quoted if necessary.

The bug in the above test was apparently caused when our code serialized
the type name to disk as the string PHone (without any quoting), and then
later deserialized it using the CQL type parser, which converted it into
a lowercase phone. With this patch, the type's name is serialized as
"PHone", with the quotes, and deserialized properly as the type PHone.
While the extra quotes may seem excessive, they are necessary for the
correct CQL type expression - remember that the type expression may be
significantly more complex, e.g., frozen<list<"PHone">> and all of this,
including the quotes, is necessary for our parser to be able to translate
this string back into a type object.

This patch may cause breakage to existing databases which used case-
sensitive user-defined types, but I argue that these use cases were
already broken (as demonstrated by this test) so we won't break anything
that actually worked before.

Fixes #5544

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200101160805.15847-1-nyh@scylladb.com>
2020-01-03 15:48:20 +02:00
Pavel Emelyanov
34f8762c4d storage_service: Drop _update_jobs
This field is write-only.
Leftover from 83ffae1 (storage_service: Drop block_until_update_pending_ranges_finished)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191226091210.20966-1-xemul@scylladb.com>
2020-01-03 15:48:20 +02:00
Pavel Emelyanov
f2b20e7083 cache_hitrate_calculator: Do not reinvent the peering_sharded_service
The class in question wants to run its own instances on different
shards, for this sake it keeps reference on sharded self to call
invoke_on() on. There's a handy peering_sharded_service<> in seastar
for the same, using it makes the code nicer and shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191226112401.23960-1-xemul@scylladb.com>
2020-01-03 15:48:19 +02:00
Rafael Ávila de Espíndola
bbed9cac35 cql3: move function creation to a .cc file
We had a lot of code in a .hh file, that while using templeates, was
only used from creating functions during startup.

This moves it to a new .cc file.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200101002158.246736-1-espindola@scylladb.com>
2020-01-03 15:48:19 +02:00
Benny Halevy
c0883407fe scripts: Add cpp-name-format: pretty printer
Pretty-print cpp-names, useful for deciphering complex backtraces.

For example, the following line:
    service::storage_proxy::init_messaging_service()::{lambda(seastar::rpc::client_info const&, seastar::rpc::opt_time_point, std::vector<frozen_mutation, std::allocator<frozen_mutation> >, db::consistency_level, std::optional<tracing::trace_info>)#1}::operator()(seastar::rpc::client_info const&, seastar::rpc::opt_time_point, std::vector<frozen_mutation, std::allocator<frozen_mutation> >, db::consistency_level, std::optional<tracing::trace_info>) const at /local/home/bhalevy/dev/scylla/service/storage_proxy.cc:4360

Is formatted as:
    service::storage_proxy::init_messaging_service()::{
      lambda(
        seastar::rpc::client_info const&,
        seastar::rpc::opt_time_point,
        std::vector<
          frozen_mutation,
          std::allocator<frozen_mutation>
        >,
        db::consistency_level,
        std::optional<tracing::trace_info>
      )#1
    }::operator()(
      seastar::rpc::client_info const&,
      seastar::rpc::opt_time_point,
      std::vector<
        frozen_mutation,
        std::allocator<frozen_mutation>
      >,
      db::consistency_level,
      std::optional<tracing::trace_info>
    ) const at /local/home/bhalevy/dev/scylla/service/storage_proxy.cc:4360

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191226142212.37260-1-bhalevy@scylladb.com>
2020-01-01 12:08:12 +02:00
Rafael Ávila de Espíndola
75817d1fe7 sstable: Add checks to help track problems with large_data_handler use after free
I can't quite figure out how we were trying to write a sstable with
the large data handler already stopped, but the backtrace suggests a
good place to add extra checks.

This patch adds two check. One at the start and one at the end of
sstable::write_components. The first one should give us better
backtraces if the large_data_handler is already stopped. The second
one should help catch some race condition.

Refs: #5470
Message-Id: <20191231173237.19040-1-espindola@scylladb.com>
2020-01-01 12:03:31 +02:00
Rafael Ávila de Espíndola
3c34e2f585 types: Avoid an unaligned load in json integer serialization
The patch also adds a test that makes the fixed issue easier to
reproduce.

Fixes #5413
Message-Id: <20191231171406.15980-1-espindola@scylladb.com>
2019-12-31 19:23:42 +02:00
Gleb Natapov
bae5cb9f37 commitlog: remove unused argument during segment creation
Since 99a5a77234 all segments are created
equal and "active" argument is never true, so drop it.

Message-Id: <20191231150639.GR9084@scylladb.com>
2019-12-31 17:14:03 +02:00
Rafael Ávila de Espíndola
aa535a385d enum_option_test: Add an explicit underlying type to an enum
We expect to be able to create a variable with an out of range value,
so the enum needs an explicit underlying type.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191230222029.88942-1-espindola@scylladb.com>
2019-12-31 16:59:00 +02:00
Nadav Har'El
48a914c291 Fix uninitialized members
Merged pull request https://github.com/scylladb/scylla/pull/5532 from
Benny Halevy:

Initialize bool members in row_level_repair and _storage_service causing
ubsan errors.

Fixes #5531
2019-12-31 10:32:54 +02:00
Takuya ASADA
aa87169670 dist/debian: add procps on Depends
We require procps package to use sysctl on postinst script for scylla-kernel-conf.

Fixes #5494

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191218234100.37844-1-syuu@scylladb.com>
2019-12-30 19:30:35 +02:00
Avi Kivity
972127e3a8 atomic_cell: add type-aware pretty printing
The standard printer for atomic_cell prints the value as hex,
because atomic_cell does not include the type. Add a type-aware
printer that allows the user to provide the type.
2019-12-30 18:27:04 +02:00
Avi Kivity
19f68412ad atomic_cell: move pretty printers from database.cc to atomic_cell.cc
atomic_cell.cc is the logical home for atomic_cell pretty printers,
and since we plan to add more pretty printers, start by tidying up.
2019-12-30 18:20:30 +02:00
Eliran Sinvani
21dec3881c debian-reloc: rename buld product to the name specified in SCYLLA-VERSION-GEN
When the product name is other than "scylla", the debian
packaging scripts go over all files that starts with "scylla-"
and change the prefix to be the actual product name.
However, if there are no such files in the directory
the script will fail since the renaming command will
get the wildcard string instrad of an actual file name.
This patch replaces the command with a command with
an equivalent desired effect that only operates on files
if there are any.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20191230143250.18101-1-eliransin@scylladb.com>
2019-12-30 17:45:50 +02:00
Takuya ASADA
263385cb4b dist: stop replacing /usr/lib/scylla with symlink (#5530)
Since we merged /usr/lib/scylla with /opt/scylladb, we removed
/usr/lib/scylla and replace it with the symlink point to /opt/scylladb.
However, RPM does not support replacing a directory with a symlink,
we are doing some dirty hack using RPM scriptlet, but it causes
multiple issues on upgrade/downgrade.
(See: https://docs.fedoraproject.org/en-US/packaging-guidelines/Directory_Replacement/)

To minimize Scylla upgrading/downgrade issues on user side, it's better
to keep /usr/lib/scylla directory.
Instead of creating single symlink /usr/lib/scylla -> /opt/scylladb,
we can create symlinks for each setup scripts like
/usr/lib/scylla/<script> -> /opt/scylladb/scripts/<script>.

Fixes #5522
Fixes #4585
Fixes #4611
2019-12-30 13:52:24 +02:00
Hagit Segev
9d454b7dc6 reloc/build_rpm.sh: Fix '--builddir' option handling (#5519)
The '--builddir' option value is assigned to the "builddir" variable,
which is wrong. The correct variable is "BUILDDIR" so use that instead
to fix the '--builddir' option.

Also, add logging to the script when executing the "dist/redhat_build.rpm.sh"
script to simplify debugging.
2019-12-30 13:25:22 +02:00
Benny Halevy
8aa5d84dd8 storage_service: initialize _is_bootstrap_mode
Hit the following ubsan error with bootstrap_test:TestBootstrap.manual_bootstrap_test in debug mode:
  service/storage_service.cc:3519:37: runtime error: load of value 190, which is not a valid value for type 'bool'

The use site is:
  service::storage_service::is_cleanup_allowed(seastar::basic_sstring<char, unsigned int, 15u, true>)::{lambda(service::storage_service&)#1}::operator()(service::storage_service&) const at /local/home/bhalevy/dev/scylla/service/storage_service.cc:3519

While at it, initialize `_initialized` to false as well, just in case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-12-30 11:44:58 +02:00
Benny Halevy
474ffb6e54 repair: initialize row_level_repair: _zero_rows
Avoid following UBSAN error:
repair/row_level.cc:2141:7: runtime error: load of value 240, which is not a valid value for type 'bool'

Fixes #5531

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-12-30 11:44:58 +02:00
Fabiano Lucchese
d7795b1efa scylla_setup: Support for enforcing optimal Linux clocksource setting (#5499)
A Linux machine typically has multiple clocksources with distinct
performances. Setting a high-performant clocksource might result in
better performance for ScyllaDB, so this should be considered whenever
starting it up.

This patch introduces the possibility of enforcing optimized Linux
clocksource to Scylla's setup/start-up processes. It does so by adding
an interactive question about enforcing clocksource setting to scylla_setup,
which modifies the parameter "CLOCKSOURCE" in scylla_server configuration
file. This parameter is read by perftune.py which, if set to "yes", proceeds
to (non persistently) setting the clocksource. On x86, TSC clocksource is used.

Fixes #4474
Fixes #5474
Fixes #5480
2019-12-30 10:54:14 +02:00
Avi Kivity
e223154268 cdc: options: return an empty options map when cdc is disabled
This is compatible with 3.1 and below, which didn't have that schema
field at all.
2019-12-29 16:34:37 +02:00
Benny Halevy
27e0aee358 docs/debugging.md: fix anchor links
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191229074136.13516-1-bhalevy@scylladb.com>
2019-12-29 16:26:26 +02:00
Pavel Solodovnikov
aba9a11ff0 cql: pass variable_specifications via lw_shared_ptr
Instances of `variable_specifications` are passed around as
shared_ptr's, which are redundant in this case since the class
is marked as `final`. Use `lw_shared_ptr` instead since we know
for sure it's not a polymorphic pointer.

Tests: unit(debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20191225232853.45395-1-pa.solodovnikov@scylladb.com>
2019-12-29 16:26:26 +02:00
Benny Halevy
4c884908bb directories: Keep a unique set of directories to initialize
If any two directories of data/commitlog/hints/view_hints
are the same we still end up running verify_owner_and_mode
and disk_sanity(check_direct_io_support) in parallel
on the same directoriea and hit #5510.

This change uses std::set rather than std::vector to
collect a unique set of directories that need initialization.

Fixes #5510

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191225160645.2051184-1-bhalevy@scylladb.com>
2019-12-29 16:26:26 +02:00
Gleb Natapov
60a851d3a5 commitlog: always flush segments atomically with writing
db::commitlog::segment::batch_cycle() assumes that after a write
for a certain position completes (as reported by
_pending_ops.wait_for_pending()) it will also be flushed, but this is
true only if writing and flushing are atomic wrt _pending_ops lock.
It usually is unless flush_after is set to false when cycle() is
called. In this case only writing is done under the lock. This
is exactly what happens when a segment is closed. Flush is skipped
because zero header is added after the last entry and then flushed, but
this optimization breaks batch_cycle() assumption. Fix it by flushing
after the write atomically even if a segment is being closed.

Fixes #5496

Message-Id: <20191224115814.GA6398@scylladb.com>
2019-12-24 14:52:23 +02:00
Pavel Emelyanov
a5cdfea799 directories: Do not mess with per-shard base dir
The hints and view_hints directory has per-shard sub-dirs,
and the directories code tries to create, check and lock
all of them, including the base one.

The manipulations in question are excessive -- it's enough
to check and lock either the base dir, or all the per-shard
ones, but not everything. Let's take the latter approach for
its simplicity.

Fixes #5510

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Looks-good-to: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191223142429.28448-1-xemul@scylladb.com>
2019-12-24 14:49:28 +02:00
Benny Halevy
f8f5db42ca dbuild: try to pull image if not present locally
Pekka Enberg <penberg@scylladb.com> wrote:
> Image might not be present, but the subsequent "docker run" command will automatically pull it.

Just letting "docker run" fail produces kinda confusing error message,
referring to docker help, but the we want to provide the user
with our own help, so still fail early, just also try to pull the image
if "docker image inspect" failed, indicating it's not present locally.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191223085219.1253342-4-bhalevy@scylladb.com>
2019-12-24 11:13:23 +02:00
Benny Halevy
ee2f97680a dbuild: just die when no image-id is provided
Suggested-by: Pekka Enberg <penberg@scylladb.com>
> This will print all the available Docker images,
> many (most?) of them completely unrelated.
> Why not just print an error saying that no image was specified,
> and then perhaps print usage.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191223085219.1253342-3-bhalevy@scylladb.com>
2019-12-24 11:13:22 +02:00
Benny Halevy
87b2f189f7 dbuild: s/usage/die/
Suggested-by: Dejan Mircevski <dejan@scylladb.com>
> The use pattern of this function strongly suggests a name like `die`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191223085219.1253342-2-bhalevy@scylladb.com>
2019-12-24 11:13:21 +02:00
Benny Halevy
718e9eb341 table: move_sstables_from_staging: fix use after free of shared_sstable
Introduced in 4b3243f5b9

Reproducible with materialized_views_test:TestMaterializedViews.mv_populating_from_existing_data_during_node_remove_test
and read_amplification_test:ReadAmplificationTest.no_read_amplification_on_repair_with_mv_test

==955382==ERROR: AddressSanitizer: heap-use-after-free on address 0x60200023de18 at pc 0x00000051d788 bp 0x7f8a0563fcc0 sp 0x7f8a0563fcb0
READ of size 8 at 0x60200023de18 thread T1 (reactor-1)
    #0 0x51d787 in seastar::lw_shared_ptr<sstables::sstable>::lw_shared_ptr(seastar::lw_shared_ptr<sstables::sstable> const&) /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/shared_ptr.hh:289
    #1 0x10ba189 in apply<table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>::<lambda(std::set<seastar::basic_sstring<char, unsigned int, 15> >&)>::<lambda(sstables::shared_sstable)>&, const seastar::lw_shared_ptr<sstables::sstabl
e>&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1530
    #2 0x109c4f1 in apply<table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>::<lambda(std::set<seastar::basic_sstring<char, unsigned int, 15> >&)>::<lambda(sstables::shared_sstable)>&, const seastar::lw_shared_ptr<sstables::sstabl
e>&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1556
    #3 0x106941a in do_for_each<__gnu_cxx::__normal_iterator<const seastar::lw_shared_ptr<sstables::sstable>*, std::vector<seastar::lw_shared_ptr<sstables::sstable> > >, table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>::<lambda(
std::set<seastar::basic_sstring<char, unsigned int, 15> >&)>::<lambda(sstables::shared_sstable)> > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future-util.hh:618
    #4 0x1069203 in operator() /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future-util.hh:626
    #5 0x10ba589 in apply /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/apply.hh:36
    #6 0x10ba668 in apply<seastar::do_for_each(Iterator, Iterator, AsyncAction) [with Iterator = __gnu_cxx::__normal_iterator<const seastar::lw_shared_ptr<sstables::sstable>*, std::vector<seastar::lw_shared_ptr<sstables::sstable> > >; AsyncAction = table::move_sstables_from_staging
(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>::<lambda(std::set<seastar::basic_sstring<char, unsigned int, 15> >&)>::<lambda(sstables::shared_sstable)>]::<lambda()>&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/apply.hh:44
    #7 0x10ba7c0 in apply<seastar::do_for_each(Iterator, Iterator, AsyncAction) [with Iterator = __gnu_cxx::__normal_iterator<const seastar::lw_shared_ptr<sstables::sstable>*, std::vector<seastar::lw_shared_ptr<sstables::sstable> > >; AsyncAction = table::move_sstables_from_staging
(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>::<lambda(std::set<seastar::basic_sstring<char, unsigned int, 15> >&)>::<lambda(sstables::shared_sstable)>]::<lambda()>&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1563
    ...

0x60200023de18 is located 8 bytes inside of 16-byte region [0x60200023de10,0x60200023de20)
freed by thread T1 (reactor-1) here:
    #0 0x7f8a153b796f in operator delete(void*) (/lib64/libasan.so.5+0x11096f)
    #1 0x6ab4d1 in __gnu_cxx::new_allocator<seastar::lw_shared_ptr<sstables::sstable> >::deallocate(seastar::lw_shared_ptr<sstables::sstable>*, unsigned long) /usr/include/c++/9/ext/new_allocator.h:128
    #2 0x612052 in std::allocator_traits<std::allocator<seastar::lw_shared_ptr<sstables::sstable> > >::deallocate(std::allocator<seastar::lw_shared_ptr<sstables::sstable> >&, seastar::lw_shared_ptr<sstables::sstable>*, unsigned long) /usr/include/c++/9/bits/alloc_traits.h:470
    #3 0x58fdfb in std::_Vector_base<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > >::_M_deallocate(seastar::lw_shared_ptr<sstables::sstable>*, unsigned long) /usr/include/c++/9/bits/stl_vector.h:351
    #4 0x52a790 in std::_Vector_base<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > >::~_Vector_base() /usr/include/c++/9/bits/stl_vector.h:332
    #5 0x52a99b in std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > >::~vector() /usr/include/c++/9/bits/stl_vector.h:680
    #6 0xff60fa in ~<lambda> /local/home/bhalevy/dev/scylla/table.cc:2477
    #7 0xff7202 in operator() /local/home/bhalevy/dev/scylla/table.cc:2496
    #8 0x106af5b in apply<table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()> > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1573
    #9 0x102f5d5 in futurize_apply<table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()> > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1645
    #10 0x102f9ee in operator()<seastar::semaphore_units<seastar::named_semaphore_exception_factory> > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/semaphore.hh:488
    #11 0x109d2f1 in apply /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/apply.hh:36
    #12 0x109d42c in apply<seastar::with_semaphore(seastar::basic_semaphore<ExceptionFactory, Clock>&, size_t, Func&&) [with ExceptionFactory = seastar::named_semaphore_exception_factory; Func = table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable>
 >)::<lambda()>; Clock = std::chrono::_V2::steady_clock]::<lambda(auto:51)>&, seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock>&&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/apply.hh:44
    #13 0x109d595 in apply<seastar::with_semaphore(seastar::basic_semaphore<ExceptionFactory, Clock>&, size_t, Func&&) [with ExceptionFactory = seastar::named_semaphore_exception_factory; Func = table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable>
 >)::<lambda()>; Clock = std::chrono::_V2::steady_clock]::<lambda(auto:51)>&, seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock>&&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1563
    ...

Fixes #5511

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191222214326.1229714-1-bhalevy@scylladb.com>
2019-12-23 15:20:41 +02:00
Konstantin Osipov
476fbc60be test.py: prepare to remove custom colors
Add dbuild dependency on python3-colorama,
which will be used in test.py instead of a hand-made palette.

[avi: update tools/toolchain/image]
Message-Id: <20191223125251.92064-2-kostja@scylladb.com>
2019-12-23 15:13:22 +02:00
Pavel Emelyanov
d361894b9d batchlog_manager: Speed up token_metadata endpoints counting a bit
In this place we only need to know the number of endpoints,
while current code additionally shuffles them before counting.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-12-23 14:22:45 +02:00
Pavel Emelyanov
6e06c88b4c token_metadata: Remove unused helper
There are two _identical_ methods in token_metadata class:
get_all_endpoints_count() and number_of_endpoints().
The former one is used (called) the latter one is not used, so
let's remove it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-12-23 14:22:43 +02:00
Pavel Emelyanov
2662d9c596 migration_manager: Remove run_may_throw() first argument
It's unused in this function. Also this helps getting
rid of global instances of components.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-12-23 14:22:42 +02:00
Pavel Emelyanov
703b16516a storage_service: Remove unused helper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-12-23 14:22:41 +02:00
Takuya ASADA
e0071b1756 reloc: don't archive dist/ami/files/*.rpm on relocatable package
We should skip archiving dist/ami/files/*.rpm on relocatable package,
since it doesn't used.
Also packer and variables.json, too.

Fixes #5508

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191223121044.163861-1-syuu@scylladb.com>
2019-12-23 14:19:51 +02:00
Tomasz Grabiec
28dec80342 db/schema_tables: Add trace-level logging of schema digesting
This greatly helps to narrow down the source of schema digest mismatch
between nodes. Intented use is to enable this logger on disagreeing
nodes and trigger schema digest recalculation and observe which
mutations differ in digest and then examine their content.

Message-Id: <1574872791-27634-1-git-send-email-tgrabiec@scylladb.com>
2019-12-23 12:28:22 +02:00
Konstantin Osipov
1116700bc9 test.py: do not return 0 if there are failed tests
Fix a return value regression introduced when switching to asyncio.

Message-Id: <20191222134706.16616-2-kostja@scylladb.com>
2019-12-22 16:14:32 +02:00
Asias He
7322b749e0 repair: Do not return working_row_buf_nr in get combined row hash verb
In commit b463d7039c (repair: Introduce
get_combined_row_hash_response), working_row_buf_nr is returned in
REPAIR_GET_COMBINED_ROW_HASH in addition to the combined hash. It is
scheduled to be part of 3.1 release. However it is not backported to 3.1
by accident.

In order to be compatible between 3.1 and 3.2 repair. We need to drop
the working_row_buf_nr in 3.2 release.

Fixes: #5490
Backports: 3.2
Tests: Run repair in a mixed 3.1 and 3.2 cluster
2019-12-21 20:13:15 +02:00
Takuya ASADA
8eaecc5ed6 dist/common/scripts/scylla_setup: add swap existance check
Show warnings when no swap is configured on the node.

Closes #2511

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191220080222.46607-1-syuu@scylladb.com>
2019-12-21 20:03:58 +02:00
Pavel Solodovnikov
5a15bed569 cql3: return result_set by cref in cql3::result::result_set
Changes summary:
* make `cql3::result_set` movable-only
* change signature of `cql3::result::result_set` to return by cref
* adjust available call sites to the aforementioned method to accept cref

Motivation behind this change is elimination of dangerous API,
which can easily set a trap for developers who don't expect that
result_set would be returned by value.

There is no point in copying the `result_set` around, so make
`cql3::result::result_set` to cache `result_set` internally in a
`unique_ptr` member variable and return a const reference so to
minimize unnecessary copies here and there.

Tests: unit(debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20191220115100.21528-1-pa.solodovnikov@scylladb.com>
2019-12-21 16:56:42 +02:00
Takuya ASADA
3a6cb0ed8c install.sh: drop limits.d from nonroot mode
The file only required for root mode.

Fixes #5507

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191220101940.52596-1-syuu@scylladb.com>
2019-12-21 15:26:08 +02:00
Botond Dénes
08bb0bd6aa mutation_fragment_stream_validator: wrap exceptions into own exception type
So a higher level component using the validator to validate a stream can
catch only validation errors, and let any other incidental exception
through.

This allows building data correctors on top of the
`mutation_fragment_stream_validator`, by filtering a fragment stream
through a validator, catching invalid fragment stream exceptions and
dropping the respective fragments from the stream.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191220073443.530750-1-bdenes@scylladb.com>
2019-12-20 12:05:00 +01:00
Rafael Ávila de Espíndola
91c7f5bf44 Print build-id on startup
Fixes #5426

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191218031556.120089-1-espindola@scylladb.com>
2019-12-19 15:43:04 +02:00
Avi Kivity
440ad6abcc Revert "relocatable: Check that patchelf didn't mangle the PT_LOAD headers"
This reverts commit 237ba74743. While it
works for the scylla executable, it fails for iotune, which is built
by seastar. It should be reinstated after we pass the correct link
parameters to the seastar build system.
2019-12-19 11:20:34 +02:00
Pekka Enberg
c0aea19419 Merge "Add a timeout for housekeeping for offline installs" from Amnon
"
These series solves an issue with scylla_setup and prevent it from
waiting forever if housekeeping cannot look for the new Scylla version.

Fixes #5302

It should be backported to versions that support offline installations.
"

* 'scylla_setup_timeout' of git://github.com/amnonh/scylla:
  scylla_setup: do not wait forever if no reply is return housekeeping
  scylla_util.py: Add optional timeout to out function
2019-12-19 08:18:19 +02:00
Rafael Ávila de Espíndola
8d777b3ad5 relocatable: Use a super long path for the dynamic linker
Having a long path allows patchelf to change the interpreter without
changing the PT_LOAD headers and therefore without moving the
build-id out of the first page.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191213224803.316783-1-espindola@scylladb.com>
2019-12-18 19:10:59 +02:00
Pavel Solodovnikov
c451f6d82a LWT: Fix required participants calculation for LOCAL_SERIAL CL
Suppose we have a multi-dc setup (e.g. 9 nodes distributed across
3 datacenters: [dc1, dc2, dc3] -> [3, 3, 3]).

When a query that uses LWT is executed with LOCAL_SERIAL consistency
level, the `storage_proxy::get_paxos_participants` function
incorrectly calculates the number of required participants to serve
the query.

In the example above it's calculated to be 5 (i.e. the number of
nodes needed for a regular QUORUM) instead of 2 (for LOCAL_SERIAL,
which is equivalent to LOCAL_QUORUM cl in this case).

This behavior results in an exception being thrown when executing
the following query with LOCAL_SERIAL cl:

INSERT INTO users (userid, firstname, lastname, age) VALUES (0, 'first0', 'last0', 30) IF NOT EXISTS

Unavailable: Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level for cl LOCAL_SERIAL. Requires 5, alive 3" info={'required_replicas': 5, 'alive_replicas': 3, 'consistency': 'LOCAL_SERIAL'}

Tests: unit(dev), dtest(consistency_test.py)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20191216151732.64230-1-pa.solodovnikov@scylladb.com>
2019-12-18 16:58:32 +01:00
Botond Dénes
cd6bf3cb28 scylla-gdb.py: static_vector: update for changed storage
The actual buffer is now in a member called 'data'. Leave the old
`dummy.dummy` and `dummy` as fall-back. This seems to change every
Fedora release.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191218153544.511421-1-bdenes@scylladb.com>
2019-12-18 17:39:56 +02:00
Tomasz Grabiec
5865d08d6c migration_manager: Recalculate schema only on shard 0
Schema is node-global, update_schema_version_and_announce() updates
all shards.  We don't need to recalculate it from every shard, so
install the listeners only on shard 0. Reduces noise in the logs.

Message-Id: <1574872860-27899-1-git-send-email-tgrabiec@scylladb.com>
2019-12-18 16:43:26 +02:00
Pavel Emelyanov
998f51579a storage_service: Rip join_ring config option
The option in question apparently does not work, several sharded objects
are start()-ed (and thus instanciated) in join_roken_ring, while instances
themselves of these objects are used during init of other stuff.

This leads to broken seastar local_is_initialized assertion on sys_dist_ks,
but reading the code shows more examples, e.g. the auth_service is started
on join, but is used for thrift and cql servers initialization.

The suggestion is to remove the option instead of fixing. The is_joined
logic is kept since on-start joining still can take some time and it's safer
to report real status from the API.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191203140717.14521-1-xemul@scylladb.com>
2019-12-18 12:45:13 +02:00
Nadav Har'El
8157f530f5 merge: CDC: handle schema changes
Merged pull request https://github.com/scylladb/scylla/pull/5366 from Calle Wilund:

Moves schema creation/alter/drop awareness to use new "before" callbacks from
migration manager, and adds/modifies log and streams table as part of the base
table modification.

Makes schema changes semi-atomic per node. While this does not deal with updates
coming in before a schema change has propagated cluster, it now falls into the
same pit as when this happens without CDC.

Added side effect is also that now schemas are transparent across all subsystems,
not just cql.

Patches:
  cdc_test: Add small test for altering base schema (add column)
  cdc: Handle schema changes via migration manager callbacks
  migration_manager: Invoke "before" callbacks for table operations
  migration_listener: Add empty base class and "before" callbacks for tables
  cql_test_env: Include cdc service in cql tests
  cdc: Add sharded service that does nothing.
  cdc: Move "options" to separate header to avoid to much header inclusion
  cdc: Remove some code from header
2019-12-17 23:04:36 +02:00
Avi Kivity
1157ee16a5 Update seastar submodule
* seastar 00da4c8760...0525bbb08f (7):
  > future: Simplify future_state_base::any move constructor
  > future: don't create temporary tuple on future::get().
  > future: don't instantiate new future on future::then_wrapped().
  > future: clean-up the Result handling in then_wrapped().
  > Merge "Fix core dumps when asan is enabled" from Rafael
  > future: Move ignore to the base class
  > future: Don't delete in ignore
2019-12-17 19:47:50 +02:00
Botond Dénes
638623b56b configure.py: make build.ninja target depend on SCYLLA-VERSION-GEN
Currently `SCYLLA-VERSION-GEN` is not a dependency of any target and
hence changes done to it will not be picked up by ninja. To trigger a
rebuild and hence version changes to appear in the `scylla` target
binary, one has to do `touch configure.py`. This is counter intuitive
and frustrating to people who don't know about it and wonder why their
changed version is not appearing as the output of `scylla --version`.

This patch makes `SCYLLA-VERSION-GEN` a dependency of `build.ninja,
making the `build.ninja` target out-of-date whenever
`SCYLLA-VERSION-GEN` is changed and hence will trigger a rerun of
`configure.py` when the next target is built, allowing a build of e.g.
`scylla` to pick up any changes done to the version automatically.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191217123955.404172-1-bdenes@scylladb.com>
2019-12-17 17:40:04 +02:00
Avi Kivity
7152ba0c70 Merge "tests: automatically search for unit tests" from Kostja
"
This patch set rearranges the test files so that
it is now possible to search for tests automatically,
and adds this functionality to test.py
"

* 'test.py.requeue' of ssh://github.com/scylladb/scylla-dev:
  cmake: update CMakeLists.txt to scan test/ rather than tests/
  test.py: automatically lookup all unit and boost tests
  tests: move all test source files to their new locations
  tests: move a few remaining headers
  tests: move another set of headers to the new test layout
  tests: move .hh files and resources to new locations
  tests: remove executable property from data_listeners_test.cc
2019-12-17 17:32:18 +02:00
Amnon Heiman
dd42f83013 scylla_setup: do not wait forever if no reply is return housekeeping
When scylla is installed without a network connectivity, the test if a
newer version is available can cause scylla_setup to wait forever.

This patch adds a limit to the time scylla_setup will wait for a reply.

When there is no reply, the relevent error will be shown that it was
unable to check for newer version, but this will not block the setup
script.

Fixes #5302

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2019-12-17 14:56:47 +02:00
Nadav Har'El
aa1de5a171 merge: Synchronize snapshot and staging sstable deletion using sem
Merged pull request https://github.com/scylladb/scylla/pull/5343 from
Benny Halevy.

Fixes #5340

Hold the sstable_deletion_sem table::move_sstables_from_subdirs to
serialize access to the staging directory. It now synchronizes snapshot,
compaction deletion of sstables, and view_update_generator moving of
sstables from staging.

Tests:

    unit (dev) [expect test_user_function_timestamp_return that fails for me locally, but also on master]
    snapshot_test.py (dev)
2019-12-17 14:06:02 +02:00
Juliusz Stasiewicz
7fdc8563bf system_keyspace: Added infrastructure for table `system.clients'
I used the following as a reference:
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/virtual/ClientsTable.java
At this moment there is only info about IP, clients outgoing port,
client 'type' (i.e. CQL/thrift/alternator), shard ID and username.
Column `request_count' is NOT present and CK consists of
(`port', `client_type'), contrary to what C*'s has: (`port').

Code that notifies `system.clients` about new connections goes
to top-level files `connection_notifier.*`. Currently only CQL
clients are observed, but enum `client_type` can be used in future
to notify about connections with other protocols.
2019-12-17 11:31:28 +01:00
Benny Halevy
4b3243f5b9 table: move_sstables_from_staging_in_thread with _sstable_deletion_sem
Hold the _sstable_deletion_sem while moving sstables from the staging directory
so not to move them under the feet of table::snapshot.

Fixes #5340

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-12-17 12:20:20 +02:00
Benny Halevy
0446ce712a view_update_generator::start: use variable binding
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-12-17 12:20:20 +02:00
Benny Halevy
5d7c80c148 view_update_generator::start: fix indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-12-17 12:20:20 +02:00
Benny Halevy
02784f46b9 view_update_generator: handle errors when processing sstable
Consumer may throw, in this case, break from the loop and retry.

move_sstable_from_staging_in_thread may theoretically throw too,
ignore the error in this case since the sstable was already processed,
individual move failures are already ignored and moving from staging
will be retried upon restart.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-12-17 12:20:20 +02:00
Benny Halevy
abda12107f sstables: move_to_new_dir: add do_sync_dirs param
To be used for "batch" move of several sstables from staging
to the base directory, allowing the caller to sync the directories
once when all are moved rather than for each one of them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-12-17 12:20:20 +02:00
Benny Halevy
6efef84185 sstable: return future from move_to_new_dir
distributed_loader::probe_file needlessly creates a seastar
thread for it and the next patch will use it as part of
a parallel_for_each loop to move a list of sstables
(and sync the directories once at the end).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-12-17 12:20:20 +02:00
Benny Halevy
0d2a7111b2 view_update_generator: sstable_with_table: std::move constructor args
Just a small optimization.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-12-17 12:19:55 +02:00
Nadav Har'El
fc85c49491 alternator: error on unsupported parallel scan
We do not yet support the parallel Scan options (TotalSegments, Segment),
as reported in issue #5059. But even before implementing this feature, it
is important that we produce an error if a user attempts to use it - instead
of outright ignoring this parameter. This is what this patch does.

The patch also adds a full test, test_scan.py::test_scan_parallel, for the
parallel scan feature. The test passes on DynamoDB, and still xfails
on Alternator after this patch - but now the Scan request fails immediately
reporting the unsupported option - instead of what the pre-patch code did:
returning the wrong results and the test failing just when the results
do not match the expectations.

Refs #5059.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191217084917.26191-1-nyh@scylladb.com>
2019-12-17 11:27:56 +02:00
Avi Kivity
f7d69b0428 Revert "Merge "bouncing lwt request to an owning shard" from Gleb"
This reverts commit 64cade15cc, reversing
changes made to 9f62a3538c.

This commit is suspected of corrupting the response stream.

Fixes #5479.
2019-12-17 11:06:10 +02:00
Rafael Ávila de Espíndola
237ba74743 relocatable: Check that patchelf didn't mangle the PT_LOAD headers
Should avoid issue #4983 showing up again.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191213224803.316783-2-espindola@scylladb.com>
2019-12-16 20:18:32 +02:00
Avi Kivity
3b7aca3406 Merge "db: Don't create a reference to nullptr" from Rafael
"
Only the first patch is needed to fix the undefined behavior, but the
followup ones simplify the memory management around user types.
"

* 'espindola/fix-5193-v2' of ssh://github.com/espindola/scylla:
  db: Don't use lw_shared_ptr for user_types_metadata
  user_types_metadata: don't implement enable_lw_shared_from_this
  cql3: pass a const user_types_metadata& to prepare_internal
  db: drop special case for top level UDTs
  db: simplify db::cql_type_parser::parse
  db: Don't create a reference to nullptr
  Add test for loading a schema with a non native type
2019-12-16 17:10:58 +02:00
Konstantin Osipov
d6bc7cae67 cmake: update CMakeLists.txt to scan test/ rather than tests/
A follow up on directory rename.
2019-12-16 17:47:42 +03:00
Konstantin Osipov
e079a04f2a test.py: automatically lookup all unit and boost tests 2019-12-16 17:47:42 +03:00
Konstantin Osipov
1c8736f998 tests: move all test source files to their new locations
1. Move tests to test (using singular seems to be a convention
   in the rest of the code base)
2. Move boost tests to test/boost, other
   (non-boost) unit tests to test/unit, tests which are
   expected to be run manually to test/manual.

Update configure.py and test.py with new paths to tests.
2019-12-16 17:47:42 +03:00
Konstantin Osipov
2fca24e267 tests: move a few remaining headers
Move sstable_test.hh, test_table.hh and cql_assertions.hh from tests/ to
test/lib or test/boost and update dependent .cc files.
Move tests/perf_sstable.hh to test/perf/perf_sstable.hh
2019-12-16 17:47:42 +03:00
Konstantin Osipov
b9bf1fbede tests: move another set of headers to the new test layout
Move another small subset of headers to test/
with the same goals:
- preserve bisectability
- make the revision history traceable after a move

Update dependent files.
2019-12-16 17:47:42 +03:00
Konstantin Osipov
8047d24c48 tests: move .hh files and resources to new locations
The plan is to move the unstructured content of tests/ directory
into the following directories of test/:

test/lib - shared header and source files for unit tests
test/boost - boost unit tests
test/unit - non-boost unit tests
test/manual - tests intended to be run manually
test/resource - binary test resources and configuration files

In order to not break git bisect and preserve the file history,
first move most of the header files and resources.
Update paths to these files in .cc files, which are not moved.
2019-12-16 17:47:42 +03:00
Konstantin Osipov
644595e15f tests: remove executable property from data_listeners_test.cc
Executable flag must be committed to git by mistake.
2019-12-16 17:47:41 +03:00
Benny Halevy
d2e00abe13 tests: commitlog_test: test_allocation_failure: improve error reporting
We're seeing the following error from test from time to time:
  fatal error: in "test_allocation_failure": std::runtime_error: Did not get expected exception from writing too large record

This is not reproducible and the error string does not contain
enough information to figure out what happened exactly, therefore
this patch adds an exception if the call succeeded unexpectedly
and also prints the unexpected exception if one was caught.

Refs #4714

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191215052434.129641-1-bhalevy@scylladb.com>
2019-12-16 15:38:48 +01:00
Asias He
6b7344f6e5 streaming: Fix typo in stream_result_future::maybe_complete
s/progess/progress/

Refs: #5437
2019-12-16 11:12:03 +02:00
Dejan Mircevski
f3883cd935 dbuild: Fix podman invocation (#5481)
The is_podman check was depending on `docker -v` printing "podman" in
the output, but that doesn't actually work, since podman prints $0.
Use `docker --help` instead, which will output "podman".

Also return podman's return status, which was previously being
dropped.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2019-12-16 11:11:48 +02:00
Avi Kivity
00ae4af94c Merge "Sanitize and speed-up (a bit) directories set up" from Pavel
"
On start there are two things that scylla does on data/commitlog/etc.
dirs: locks and verifies permissions. Right now these two actions are
managed by different approaches, it's convenient to merge them.

Also the introduced in this set directories class makes a ground for
better --workdir option handling. In particular, right now the db::config
entries are modified after options parse to update directories with
the workdir prefix. With the directories class at hands will be able
to stop doing this.
"

* 'br-directories-cleanup' of https://github.com/xemul/scylla:
  directories: Make internals work on fs::path
  directories: Cleanup adding dirs to the vector to work on
  directories: Drop seastar::async usage
  directories: Do touch_and_lock and verify sequentially
  directories: Do touch_and_lock in parallel
  directories: Move the whole stuff into own .cc file
  directories: Move all the dirs code into .init method
  file_lock: Work with fs::path, not sstring
2019-12-15 16:02:46 +02:00
Takuya ASADA
5e502ccea9 install.sh: setup workdir correctly on nonroot mode
Specify correct workdir on nonroot mode, to set correct path of
data / commitlog / hints directories at once.

Fixes #5475

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191213012755.194145-1-syuu@scylladb.com>
2019-12-15 16:00:57 +02:00
Avi Kivity
c25d51a4ea Revert "scylla_setup: Support for enforcing optimal Linux clocksource setting (#5379)"
This reverts commit 4333b37f9e. It breaks upgrades,
and the user question is not informative enough for the user to make a correct
decision.

Fixes #5478.
Fixes #5480.
2019-12-15 14:37:40 +02:00
Pavel Emelyanov
23a8d32920 directories: Make internals work on fs::path
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-12-12 19:52:01 +03:00
Pavel Emelyanov
373fcfdb3e directories: Cleanup adding dirs to the vector to work on
The unordered_set is turned into vector since for fs::path
there's no hash() method that's needed for set.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-12-12 19:52:01 +03:00
Pavel Emelyanov
14437da769 directories: Drop seastar::async usage
Now the only future-able operation remained is the call to
parallel_for_each(), all the rest is non-blocking preparation,
so we can drop the seastar::async and just return the future
from parallel_for_each.

The indendation is now good, as in previous patch is was prepared
just for that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-12-12 19:52:01 +03:00
Pavel Emelyanov
06f4f3e6d8 directories: Do touch_and_lock and verify sequentially
The goal is to drop the seastar::async() usage.

Currently we have two places that return futures -- calls to
parallel_for_each-s.  We can either chain them together or,
since both are working on the same set of directories, chain
actions inside them.

For code simplicity I propose to chain actions.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-12-12 19:52:01 +03:00
Pavel Emelyanov
8d0c820aa1 directories: Do touch_and_lock in parallel
The list of paths that should be touch-and-locked is already
at hands, this shortens the code and makes it slightly faster
(in theory).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-12-12 19:52:01 +03:00
Pavel Emelyanov
71a528d404 directories: Move the whole stuff into own .cc file
In order not to pollute the root dir place the code in
utils/ directory, "utils" namespace.

While doing this -- move the touch_and_lock from the
class declaration.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-12-12 19:52:01 +03:00
Benny Halevy
9ec98324ed messaging_service: unregister_handler: return rpc unregister_handler future
Now that seastar returns it.

Fixes https://github.com/scylladb/scylla/issues/5228

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191212143214.99328-1-bhalevy@scylladb.com>
2019-12-12 16:38:36 +02:00
Pavel Emelyanov
f2b3c17e66 directories: Move all the dirs code into .init method
The seastar::async usage is tempoarary, added for bisect-safety,
soon it will go away. For this reason the indentation in the
.init method is not "canonical", but is prepared for one-patch
drop of the seastar::async.

The hinted_handoff_enabled arg is there, as it's not just a
parameter on config, it had been parsed in main.cc.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-12-12 17:33:11 +03:00
Pavel Emelyanov
82ef2a7730 file_lock: Work with fs::path, not sstring
The main.cc code that converts sstring to fs::path
will be patched soon, the file_desc::open belongs
to seastar and works on sstrings.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-12-12 17:32:10 +03:00
Konstantin Osipov
bc482ee666 test.py: remove an unused option
Message-Id: <20191204142622.89920-2-kostja@scylladb.com>
2019-12-12 15:53:35 +02:00
Avi Kivity
64cade15cc Merge "bouncing lwt request to an owning shard" from Gleb
"
LWT is much more efficient if a request is processed on a shard that owns
a token for the request. This is because otherwise the processing will
bounce to an owning shard multiple times. The patch proposes a way to
move request to correct shard before running lwt.  It works by returning
an error from lwt code if a shard is incorrect one specifying the shard
the request should be moved to. The error is processed by the transport
code that jumps to a correct shard and re-process incoming message there.
"

* 'gleb/bounce_lwt_request' of github.com:scylladb/seastar-dev:
  lwt: take raw lock for entire cas duration
  lwt: drop invoke_on in paxos_state prepare and accept
  lwt: Process lwt request on a owning shard
  storage_service: move start_native_transport into a thread
  transport: change make_result to takes a reference to cql result instead of shared_ptr
2019-12-12 15:50:22 +02:00
Nadav Har'El
9f62a3538c alternator: fix BEGINS_WITH operator for blobs
The implementation of Expected's BEGINS_WITH operator on blobs was
incorrect, naively comparing the base64-encoded strings, which doesn't
work. This patches fixes the code to compare the decoded strings.

The reason why the BEGINS_WITH test missed this bug was that we forgot
to check the blob case and only tested the string case; So this patch
also adds the missing test - which reproduces this bug, and verifies
its fix.

Fixes #5457

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191211115526.29862-1-nyh@scylladb.com>
2019-12-12 14:02:56 +01:00
Dejan Mircevski
27b8b6fe9d cql3: Fix needs_filtering() for clustering columns
The LIKE operator requires filtering, so needs_filtering() must check
is_LIKE().  This already happens for partition columns, but it was
overlooked for clustering columns in the initial implementation of
LIKE.

Fixes #5400.

Tests: unit(dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2019-12-12 01:19:13 +02:00
Benny Halevy
d1bcb39e7f hinted handoff: log message after removing hints directory (#5372)
To be used by dtest as an indicator that endpoint's hints
were drained and hints directory is removed.

Refs #5354

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-12-12 01:16:19 +02:00
Rafael Ávila de Espíndola
3b61cf3f0b db: Don't use lw_shared_ptr for user_types_metadata
The user_types_metadata can simply be owned by the keyspace. This
simplifies the code since we never have to worry about nulls and the
ownership is now explicit.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-12-11 10:44:40 -08:00
Rafael Ávila de Espíndola
a55838323b user_types_metadata: don't implement enable_lw_shared_from_this
It looks like this was done just to avoid including
user_types_metadata.hh, which seems a bit much considering that it
requires adding specialization to the seastar namespace.

A followup patch will also stop using lw_shared_ptr for
user_types_metadata.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-12-11 10:44:40 -08:00
Rafael Ávila de Espíndola
f7c2c60b07 cql3: pass a const user_types_metadata& to prepare_internal
We never modify the user_types_metadata via prepare_internal, so we
can pass it a const reference.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-12-11 10:44:40 -08:00
Rafael Ávila de Espíndola
99cb8965be db: drop special case for top level UDTs
This was originally done in 7f64a6ec4b,
but that commit was reverted in reverted in
8517eecc28.

The revert was done because the original change would call parse_raw
for non UDT types. Unlike the old patch, this one doesn't change the
behavior of non UDT types.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-12-11 10:44:40 -08:00
Rafael Ávila de Espíndola
7ae9955c5f db: simplify db::cql_type_parser::parse
The variant of db::cql_type_parser::parse that has a
user_types_metadata argument was only used from the variant that
didn't. This inlines one in the other.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-12-11 10:44:40 -08:00
Rafael Ávila de Espíndola
2092e1ef6f db: Don't create a reference to nullptr
The user_types variable can be null during db startup since we have to
create types before reading the system table defining user types.

This avoids undefined behavior, but is unlikely that it was causing
more serious problems since the variable is only used when creating
user types and we don't create any until after all system tables are
read, in which case the user_types variable is not null.

Fixes #5193

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-12-11 10:44:40 -08:00
Rafael Ávila de Espíndola
6143941535 Add test for loading a schema with a non native type
This would have found the error with the previous version of the patch
series.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-12-11 10:43:34 -08:00
Gleb Natapov
64cfb9b1f6 lwt: take raw lock for entire cas duration
It will prevent parallel update by the same coordinator and should
reduce contention.
2019-12-11 14:41:31 +02:00
Gleb Natapov
898d2330a2 lwt: drop invoke_on in paxos_state prepare and accept
Since lwt requests are now running on an owning shard there is no longer
a need to invoke cross shard call.
2019-12-11 14:41:31 +02:00
Gleb Natapov
964c532c4f lwt: Process lwt request on a owning shard
LWT is much more efficient if a request is processed on a shard that owns
a token for the request. This is because otherwise the processing will
bounce to an owning shard multiple times. The patch proposes a way to
move request to correct shard before running lwt.  It works by returning
an error from lwt code if a shard is incorrect one specifying the shard
the request should be moved to. The error is processed by transport code
that jumps to a correct shard and re-process incoming message there.
2019-12-11 14:41:31 +02:00
Gleb Natapov
54be057af3 storage_service: move start_native_transport into a thread
The code runs only once and it is simple if it runs in a seastar thread.
2019-12-11 14:41:31 +02:00
Gleb Natapov
007ba3e38e transport: change make_result to takes a reference to cql result instead of shared_ptr 2019-12-11 14:41:31 +02:00
Nadav Har'El
9e5c6995a3 alternator-test: add tests for ReturnValues parameter
This patch adds comprehensive tests for the ReturnValue parameter of
the write operations (PutItem, UpdateItem, DeleteItem), which can return
pre-write or post-write values of the modified item. The tests are in
a new test file, alternator-test/test_returnvalues.py.

This feature is not yet implemented in Alternator, so all the new
tests xfail on Alternator (and all pass on AWS).

Refs #5053

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191127163735.19499-1-nyh@scylladb.com>
2019-12-11 13:26:39 +01:00
Nadav Har'El
ab69bfc111 alternator-test: add xfailing tests for ScanIndexForward
This patch adds tests for Query's "ScanIndexForward" parameter, which
can be used to return items in reversed sort order.
We test that a Limit works and returns the given number of *last* items
in the sort order, and also that such reverse queries can be resumed,
i.e., paging works in the reverse order.

These tests pass against AWS DynamoDB, but fail against Alternator (which
doesn't support ScanIndexForward yet), so it is marked xfail.

Refs #5153.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191127114657.14953-1-nyh@scylladb.com>
2019-12-11 13:26:39 +01:00
Pekka Enberg
6bc18ba713 storage_proxy: Remove reference to MBean interface
The JMX interface is implemented by the scylla-jmx project, not scylla.
Therefore, let's remove this historical reference to MBeans from
storage_proxy.

Message-Id: <20191211121652.22461-1-penberg@scylladb.com>
2019-12-11 14:24:28 +02:00
Avi Kivity
63474a3380 Merge "Add experimental_features option" from Dejan
"
Add --experimental-features -- a vector of features to unlock. Make corresponding changes in the YAML parser.

Fixes #5338
"

* 'vecexper' of https://github.com/dekimir/scylla:
  config: Add `experimental_features` option
  utils: Add enum_option
2019-12-11 14:23:08 +02:00
Avi Kivity
56b9bdc90f Update seastar submodule
* seastar e440e831c8...00da4c8760 (7):
  > Merge "reactor: fix iocb pool underflow due to unaccounted aio fsync" from Avi
Fixes #5443.
  > install-dependencies.sh: fix arch dependencies
  > Merge " rpc: fix use-after-free during rpc teardown vs. rpc server message handling" from Benny
  > Merge "testing: improve the observability of abandoned failed futures" from Botond
  > rework the fair_queue tester
  > directory_test: Update to use run instead of run_deprecated
  > log: support fmt 6.0 branch with chrono.h for log
2019-12-11 14:17:49 +02:00
Benny Halevy
105c8ef5a9 messaging_service: wait on unregister_handler
Prepare for returning future<> from seastar rpc
unregister_handler.

Refs https://github.com/scylladb/scylla/issues/5228

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191208153924.1953-1-bhalevy@scylladb.com>
2019-12-11 14:17:41 +02:00
Nadav Har'El
06c3802a1a storage_proxy: avoid overflow in view-backlog delay calculation
In the calculate_delay() code for view-backlog flow control, we calculate
a delay and cap it at a "budget" - the remaining timeout. This timeout is
measured in milliseconds, but the capping calculation converted it into
microseconds, which overflowed if the timeout is very large. This causes
some tests which enable the UB sanitizer to fail.

We fix this problem by comparing the delay to the budget in millisecond
resolution, not in microsecond resolution. Then, if the calculated delay
is short enough, we return it using its full microsecond resolution.

Fixes #5412

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191205131130.16793-1-nyh@scylladb.com>
2019-12-11 14:10:54 +02:00
Nadav Har'El
2824d8f6aa Merge: alternator: Fix EQ operator for sets
Merged pull request https://github.com/scylladb/scylla/pull/5453
from Piotr Sarna:

Checking the EQ relation for alternator attributes is usually performed
simply by comparing underlying JSON objects, but sets (SS, BS, NS types)
need a special routine, as we need to make sure that sets stored in
a different order underneath are still equal, e.g:

[1, 3, 2] == [1, 2, 3]

Fixes #5021
2019-12-11 13:20:25 +02:00
Piotr Sarna
421db1dc9d alternator-test: remove XFAIL from set EQ test
With this series merged, test_update_expected_1_eq_set from
test_expected.py suite starts passing.
2019-12-11 12:07:39 +01:00
Piotr Sarna
a8e45683cb alternator: add EQ comparison for sets
Checking the EQ relation for alternator attributes is usually performed
simply by comparing underlying JSON objects, but sets (SS, BS, NS types)
need a special routine, as we need to make sure that sets stored in
a different order underneath are still equal, e.g:
[1, 3, 2] == [1, 2, 3]

Fixes #5021
2019-12-11 12:07:39 +01:00
Piotr Sarna
fb37394995 schema_tables: notify table deletions before creations
If a set of mutations contains both an entry that deletes a table
and an entry that adds a table with the same name, it's expected
to be a replacement operation (delete old + create new),
rather than a useless "try to create a table even though it exists
already and then immediately delete the original one" operation.
As such, notifications about the deletions should be performed
before notifications about the creations. The place that originally
suffered from this wrong order is view building - which in this case
created an incorrect duplicated entry in the view building bookkeeping,
and then immediately deleted it, resulting in having old, deprecated
entries with stale UUIDS lying in the build queue and never proceeding,
because the underlying table is long gone.
The issue is fixed by ensuring the order of notifications:
 - drops are announced first, view drops are announced before table drops;
 - creations follow, table creations are announced before views;
 - finally, changes to tables and views are announced;

Fixes #4382

Tests: unit(dev), mv_populating_from_existing_data_during_node_stop_test
2019-12-11 12:48:29 +02:00
Benny Halevy
d544df6c3c dist/ami/build_ami.sh: support incremental build of rpms (#5191)
Iterate over an array holding all rpm names to see if any
of them is missing from `dist/ami/files`. If they are missing,
look them up in build/redhat/RPMS/x86_64 so that if reloc/build_rpm.sh
was run manually before dist/ami/build_ami.sh we can just collect
the built rpms from its output dir.

If we're still missing any rpms, then run reloc/build_rpm.sh
and copy the required rpms from build/redhat/RPMS/x86_64.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Glauber Costa <glauber@scylladb.com>
2019-12-11 12:48:29 +02:00
Amnon Heiman
f43285f39a api: replace swagger definition to use long instead of int (#5380)
In swagger 1.2 int is defined as int32.

We originally used int following the jmx definition, in practice
internally we use uint and int64 in many places.

While the API format the type correctly, an external system that uses
swagger-based code generator can face a type issue problem.

This patch replace all use of int in a return type with long that is defined as int64.

Changing the return type, have no impact on the system, but it does help
external systems that use code generator from swagger.

Fixes #5347

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2019-12-11 12:48:29 +02:00
Nadav Har'El
2abac32f2e Merged: alternator: Implement CONTAINS and NOT_CONTAINS in Expected
Merged pull request https://github.com/scylladb/scylla/pull/5447
by Dejan Mircevski.

Adds the last missing operators in the "Expected" parameter and re-enable
their tests.

Fixes #5034.
2019-12-11 12:48:29 +02:00
Cem Sancak
86b8036502 Fix DPDK mode in prepare script
Fixes #5455.
2019-12-11 12:48:29 +02:00
Calle Wilund
35089da983 conf/config: Add better descriptive text on server/client encryption
Provide some explanation on prio strings + direction to gnutls manual.
Document client auth option.
Remove confusing/misleading statement on "custom options"

Message-Id: <20191210123714.12278-1-calle@scylladb.com>
2019-12-11 12:48:28 +02:00
Dejan Mircevski
32af150f1d alternator: Implement NOT_CONTAINS operator in Expected
Enable existing NOT_CONTAINS test, add NOT_CONTAINS to the list of
recognized operators, implement check_NOT_CONTAINS, and hook it up to
verify_expected_one().

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2019-12-10 15:31:47 -05:00
Dejan Mircevski
bd2bd3c7c8 alternator: Implement CONTAINS operator in Expected
Enable existing CONTAINS test, implement check_CONTAINS, and hook it
up to verify_expected_one().

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2019-12-10 15:31:47 -05:00
Dejan Mircevski
5a56fd384c config: Add experimental_features option
When the user wants to turn on only some experimental features, they
can use this new option.  The existing `experimental` option is
preserved for backwards compatibility.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2019-12-10 11:47:03 -05:00
Piotr Sarna
9504bbf5a4 alternator: move unwrap_set to serialization header
The utility function for unwrapping a set is going to be useful
across source files, so it's moved to serialization.hh/serialization.cc.
2019-12-10 15:08:47 +01:00
Piotr Sarna
4660e58088 alternator: move rjson value comparison to rjson.hh
The comparison struct is going to be useful across source files,
so it's moved into rjson header, where it conceptually belongs anyway.
2019-12-10 15:08:47 +01:00
Botond Dénes
db0e2d8f90 scylla-gdb.py: document and add safety net to seastar::thread related commands
Almost all commands provided by `scylla-gdb.py` are safe to use. The
worst that could happen if they fail is that you won't get the desired
information. There is one notable exception: `scylla thread`. If
anything goes wrong while this command is executed - gdb crashes, a bug
in the command, etc. - there is a good change the process under
examination will crash. Sometimes this is fine, but other times e.g.
when live debugging a production node, this is unacceptable.
To avoid any accidents add documentation to all commands working with
`seastar::thread`. And since most people don't read documentation,
especially when debugging under pressure, add a safety net to the
`scylla thread` command. When run, this command will now warn of the
dangers and will ask for explicit acknowledgment of the risk of crash,
by means of passing an `--iamsure` flag. When this flag is missing, it
will refuse to run. I am sure this will be very annoying but I am also
sure that the avoided crashes are worth it.

As part of making `scylla thread` safe, its argument parsing code is
migrated to `argparse`. This changes the usage but this should be fine
because it is well documented.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191129092838.390878-1-bdenes@scylladb.com>
2019-12-10 11:51:57 +02:00
Eliran Sinvani
765db5d14f build_ami: Trim ami description attribute to the allowed size
The ami description attribute is only allowed to be 255
characters long. When build_ami.sh generates an ami, it
generates an ami description which is a concatenation
of all of the componnents version strings. It can
happen that the description string is too long which
eventually causes the ami build to fail. This patch
trims the description string to 255 characters.
It is ok since the individual versions of the components
are also saved in tags attached to the image.

Tests:
 1. Reproduced with a long description and
    validated that it doesn't fail after the fix.

Fixes #5435

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20191209141143.28893-1-eliransin@scylladb.com>
2019-12-10 11:51:57 +02:00
Fabiano Lucchese
4333b37f9e scylla_setup: Support for enforcing optimal Linux clocksource setting (#5379)
A Linux machine typically has multiple clocksources with distinct
performances. Setting a high-performant clocksource might result in
better performance for ScyllaDB, so this should be considered whenever
starting it up.

This patch introduces the possibility of enforcing optimized Linux
clocksource to Scylla's setup/start-up processes. It does so by adding
an interactive question about enforcing clocksource setting to scylla_setup,
which modifies the parameter "CLOCKSOURCE" in scylla_server configuration
file. This parameter is read by perftune.py which, if set to "yes", proceeds
to (non persistently) setting the clocksource. On x86, TSC clocksource is
used.

Fixes #4474
2019-12-10 11:51:57 +02:00
Pavel Emelyanov
3a21419fdb features: Remove _FEATURE suffix from hinted_handoff feature name
All the other features are named w/o one. The internal const-s
are all different, but I'm fixing it separately.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191209154310.21649-1-xemul@scylladb.com>
2019-12-10 11:51:57 +02:00
Dejan Mircevski
a26bd9b847 utils: Add enum_option
This allows us to accept command-line options with a predefined set of
valid arguments.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2019-12-09 09:45:59 -05:00
Calle Wilund
7c5e4c527d cdc_test: Add small test for altering base schema (add column) 2019-12-09 14:35:04 +00:00
Calle Wilund
cb0117eb44 cdc: Handle schema changes via migration manager callbacks
This allows us to create/alter/drop log and desc tables "atomically"
with the base, by including these mutations in the original mutation
set, i.e. batch create/alter tables.

Note that population does not happen until types are actually
already put into database (duh), thus there _is_ still a gap
between creating cdc and it being truly usable. This may or may
not need handling later.
2019-12-09 14:35:04 +00:00
Rafael Ávila de Espíndola
761b19cee5 build: Split the build and host linker flags
A general build system knows about 3 machines:

* build: where the building is running
* host: where the built software will run
* target: the machine the software will produce code for

The target machine is only relevant for compilers, so we can ignore
it.

Until now we could ignore the build and host distinction too. This
patch adds the first difference: don't use host ld_flags when linking
build tools (gen_crc_combine_table).

The reason for this change is to make it possible to build with
-Wl,--dynamic-linker pointing to a path that will exist on the host
machine, but may not exist on the build machine.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191207030408.987508-1-espindola@scylladb.com>
2019-12-09 15:54:57 +02:00
Calle Wilund
27183f648d migration_manager: Invoke "before" callbacks for table operations
Potentially allowing (cdc) augmentation of mutations.

Note: only does the listener part in seastar::thread, to avoid
changing call behaviour.
2019-12-09 12:12:09 +00:00
Calle Wilund
f78a3bf656 migration_listener: Add empty base class and "before" callbacks for tables
Empty base type makes for less boiler plate in implementations.
The "before" callbacks are for listeners who need to potentially
react/augment type creation/alteration _before_ actually
committing type to schema tables (and holding the semaphore for this).

I.e. it is for cdc to add/modify log/desc tables "atomically" with base.
2019-12-09 12:12:09 +00:00
Calle Wilund
4e406105b1 cql_test_env: Include cdc service in cql tests 2019-12-09 12:12:09 +00:00
Calle Wilund
a21e140169 cdc: Add sharded service that does nothing.
But can be used to hang functionality into eventually.
2019-12-09 12:12:09 +00:00
Calle Wilund
2787b0c4f8 cdc: Move "options" to separate header to avoid to much header inclusion
cdc should not contaminate the whole universe.
2019-12-09 12:12:09 +00:00
fastio
8f326b28f4 Redis: Combine all the source files redis/commands/* into redis/commands.{hh,cc}
Fixes: #5394

Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>
2019-12-08 13:54:33 +02:00
Avi Kivity
9c63cd8da5 sysctl: reduce kernel tendency to swap anonymous pages relative to page cache (#5417)
The vm.swappiness sysctl controls the kernel's prefernce for swapping
anonymous memory vs page cache. Since Scylla uses very large amounts
of anonymous memory, and tiny amounts of page cache, the correct setting
is to prefer swapping page cache. If the kernel swaps anonymous memory
the reactor will stall until the page fault is satisfied. On the other
hand, page cache pages usually belong to other applications, usually
backup processes that read Scylla files.

This setting has been used in production in Scylla Cloud for a while
with good results.

Users can opt out by not installing the scylla-kernel-conf package
(same as with the other kernel tunables).
2019-12-08 13:04:25 +02:00
Avi Kivity
0e319e0359 Update seastar submodule
* seastar 166061da3...e440e831c (8):
  > Fail tests on ubsan errors
  > future: make a couple of asserts more strict
  > future: Move make_ready out of line
  > config: Do not allow zero rates
Fixes #5360
  > future: add new state to avoid temporaries in get_available_state().
  > future: avoid temporary future_state on get_available_state().
  > future: inline future::abandoned
  > noncopyable_function: Avoid uninitialized warning on empty types
2019-12-06 18:33:23 +02:00
Piotr Sarna
0718ff5133 Merge 'min/max on collections returns human-readable result' from Juliusz
Previously, scylla used min/max(blob)->blob overload for collections,
tuples and UDTs; effectively making the results being printed as blobs.
This PR adds "dynamically"-typed min()/max() functions for compound types.

These types can be complicated, like map<int,set<tuple<..., and created
in runtime, so functions for them are created on-demand,
similarly to tojson(). The comparison remains unchanged - underneath
this is still byte-by-byte weak lex ordering.

Fixes #5139

* jul-stas/5139-minmax-bad-printing-collections:
  cql_query_tests: Added tests for min/max/count on collections
  cql3: min()/max() for collections/tuples/UDTs do not cast to blobs
2019-12-06 16:40:17 +01:00
Juliusz Stasiewicz
75955beb0b cql_query_tests: Added tests for min/max/count on collections
This tests new min/max function for collections and tuples. CFs
in test suite were named according to types being tested, e.g.
`cf_map<int,text>' what is not a valid CF name. Therefore, these
names required "escaping" of invalid characters, here: simply
replacing with '_'.
2019-12-06 12:15:49 +01:00
Juliusz Stasiewicz
9efad36fb8 cql3: min()/max() for collections/tuples/UDTs do not cast to blobs
Before:
cqlsh> insert into ks.list_types (id, val) values (1, [3,4,5]);
cqlsh> select max(val) from ks.list_types;

 system.max(val)
------------------------------------------------------------
 0x00000003000000040000000300000004000000040000000400000005

After:
cqlsh> select max(val) from ks.list_types;

 system.max(val)
--------------------
 [3, 4, 5]

This is accomplished similarly to `tojson()`/`fromjson()`: functions
are generated on demand from within `cql3::functions::get()`.
Because collections can have a variety of types, including UDTs
and tuples, it would be impossible to statically define max(T t)->T
for every T. Until now, max(blob)->blob overload was used.

Because `impl_max/min_function_for` is templated with the
input/output type, which can be defined in runtime, we need type-erased
("dynamic") versions of these functors. They work identically, i.e.
they compare byte representations of lhs and rhs with
`bytes::operator<`.

Resolves #5139
2019-12-06 12:14:51 +01:00
Avi Kivity
a18a921308 docs: maintainer.md: use command line to merge multi-commit pull requests
If you merge a pull request that contains multiple patches via
the github interface, it will document itself as the committer.

Work around this brain damage by using the command line.
2019-12-06 10:59:46 +01:00
Botond Dénes
7b37a700e1 configure.py: make tests explicitely depend on libseastar_testing.a
So that changes to libseastar_testing.a make all test target out of
date.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191205142436.560823-1-bdenes@scylladb.com>
2019-12-05 19:30:34 +02:00
Piotr Sarna
3a46b1bb2b Merge "handle hints on separate connection and scheduling group" from Piotr
Introduce a new verb dedicated for receiving and sending hints: HINT_MUTATION. It is handled on the streaming connection, which is separate from the one used for handling mutations sent by coordinator during a write.

The intent of using a separate connection is to increase fairness while handling hints and user requests - this way, a situation can be avoided in which one type of requests saturate the connection, negatively impacting the other one.

Information about new RPC support is propagated through new gossip feature HINTED_HANDOFF_SEPARATE_CONNECTION.

Fixes #4974.

Tests: unit(release)
2019-12-05 17:25:26 +01:00
Calle Wilund
c11874d851 gms::inet_address: Use special ostream formatting to match Java
To make gms::inet_address::to_string() similar in output to origin.
The sole purpose being quick and easy fix of API/JMX ipv6
formatting of endpoints etc, where strings are used as lexical
comparisons instead of textual representation.

A better, but more work, solution is to fix the scylla-jmx
bridge to do explicit parse + re-format of addresses, but there
are many such callpoints.

An even better solution would be to fix nodetool to not make this
mistake of doing lexical comparisons, but then we risk breaking
merge compatibility. But could be an option for a separate
nodeprobe impl.

Message-Id: <20191204135319.1142-1-calle@scylladb.com>
2019-12-05 17:01:26 +02:00
Gleb Natapov
4893bc9139 tracing: split adding prepared query parameters from stopping of a trace
Currently query_options objects is passed to a trace stopping function
which makes it mandatory to make them alive until the end of the
query. The reason for that is to add prepared statement parameters to
the trace.  All other query options that we want to put in the trace are
copied into trace_state::params_values, so lets copy prepared statement
parameters there too. Trace enabled case will become a little bit more
expensive but on the other hand we can drop a continuation that holds
query_options object alive from a fast path. It is safe to drop the call
to stop_foreground_prepared() here since The tracing will be stopped
in process_request_one().

Message-Id: <20191205102026.GJ9084@scylladb.com>
2019-12-05 17:00:47 +02:00
Tomasz Grabiec
aa173898d6 Merge "Named semaphores in concurrency reader, segment_manager and region_group" from Juliusz
Selected semaphores' names are now included in exception messages in
case of timeout or when admission queue overflows.

Resolves #5281
2019-12-05 14:19:56 +01:00
Nadav Har'El
5b2f35a21a Merge "Redis: fix the options related to Redis API, fix the DEL and GET command"
Merged pull request https://github.com/scylladb/scylla/pull/5381 by
Peng Jian, fixing multiple small issues with Redis:

* Rename the options related to Redis API, and describe them clearly.
* Rename redis_transport_port to redis_port
* Rename redis_transport_port_ssl to redis_ssl_port
* Rename redis_default_database_count to redis_database_count
* Remove unnecessary option enable_redis_protocol
* Modify the default value of opition redis_read_consistency_level and redis_write_consistency_level to LOCAL_QUORUM

* Fix the DEL command: support to delete mutilple keys in one command.

* Fix the GET command: return the empty string when the required key is not exists.

* Fix the redis-test/test_del_non_existent_key: mark xfail.
2019-12-05 11:58:34 +02:00
Avi Kivity
85822c7786 database: fix schema use-after-move in make_multishard_streaming_reader
On aarch64, asan detected a use-after-move. It doesn't happen on x86_64,
likely due to different argument evaluation order.

Fix by evaluating full_slice before moving the schema.

Note: I used "auto&&" and "std::move()" even though full_slice()
returns a reference. I think this is safer in case full_slice()
changes, and works just as well with a reference.

Fixes #5419.
2019-12-05 11:58:34 +02:00
Piotr Sarna
79c3a508f4 table: Reduce read amplification in view update generation
This commit makes sure that single-partition readers for
read-before-write do not have fast-forwarding enabled,
as it may lead to huge read amplification. The observed case was:
1. Creating an index.
  CREATE INDEX index1  ON myks2.standard1 ("C1");
2. Running cassandra-stress in order to generate view updates.
cassandra-stress write no-warmup n=1000000 cl=ONE -schema \
  'replication(factor=2) compaction(strategy=LeveledCompactionStrategy)' \
  keyspace=myks2 -pop seq=4000000..8000000 -rate threads=100 -errors
  skip-read-validation -node 127.0.0.1;

Without disabling fast-forwarding, single-partition readers
were turned into scanning readers in cache, which resulted
in reading 36GB (sic!) on a workload which generates less
than 1GB of view updates. After applying the fix, the number
dropped down to less than 1GB, as expected.

Refs #5409
Fixes #4615
Fixes #5418
2019-12-05 11:58:34 +02:00
Konstantin Osipov
6a5e7c0e22 tests: reduce the number of iterations of dynamic_bitset_test
This test execution time dominates by a serious margin
test execution time in dev/release mode: reducing its
execution time improves the test.py turnaround by over 70%.

Message-Id: <20191204135315.86374-2-kostja@scylladb.com>
2019-12-05 11:58:34 +02:00
Avi Kivity
07427c89a2 gdb: change 'scylla thread' command to access fs_base register directly
Currently, 'scylla thread' uses arch_prctl() to extract the value of
fsbase, used to reference thread local variables. gdb 8 added support
for directly accessing the value as $fs_base, so use that instead. This
works from core dumps as well as live processes, as you don't need to
execute inferior functions.

The patch is required for debugging threads in core dumps, but not
sufficient, as we still need to set $rip and $rsp, and gdb still[1]
doesn't allow this.

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=9370
2019-12-05 11:58:34 +02:00
Piotr Dulikowski
adfa7d7b8d messaging_service: don't move unsigned values in handlers
Performing std::move on integral types is pointless. This commit gets
rid of moves of values of `unsigned` type in rpc handlers.
2019-12-05 00:58:31 +01:00
Piotr Dulikowski
77d2ceaeba storage_proxy: handle hints through separate rpc verb 2019-12-05 00:51:52 +01:00
Piotr Dulikowski
2609065090 storage_proxy: move register_mutation handler to local lambda
This refactor makes it possible to reuse the lambda in following
commits.
2019-12-05 00:51:52 +01:00
Piotr Dulikowski
6198ee2735 hh: introduce HINTED_HANDOFF_SEPARATE_CONNECTION feature
The feature introduced by this commit declares that hints can be sent
using the new dedicated RPC verb. Before using the new verb, nodes need
to know if other nodes in the cluster will be able to handle the new
RPC verb.
2019-12-05 00:51:52 +01:00
Piotr Dulikowski
2e802ca650 hh: add HINT_MUTATION verb
Introduce a new verb dedicated for receiving and sending hints:
HINT_MUTATION. It is handled on the streaming connection, which is
separate from the one used for handling mutations sent by coordinator
during a write.

The intent of using a separate connection is to increase fariness while
handling hints and user requests - this way, a situation can be avoided
in which one type of requests saturate the connection, negatively
impacting the other one.
2019-12-05 00:51:49 +01:00
Avi Kivity
fd951a36e3 Merge "Let compaction wait on background deletions" from Benny
"
In several cases in distributed testing (dtest) we trigger compaction using nodetool compact assuming that when it is done, it is indeed really done.
However, the way compaction is currently implemented in scylla, it may leave behind some background tasks to delete the old sstables that were compacted.

This commit changes major compaction (triggered via the ss::force_keyspace_compaction api) so it would wait on the background deletes and will return only when they finish.

Fixes #4909

Tests: unit(dev), nodetool_refresh_with_data_perms_test, test_nodetool_snapshot_during_major_compaction
"
2019-12-04 11:18:41 +02:00
Takuya ASADA
c9d8606786 dist/common/scripts/scylla_ntp_setup: relax RHEL version check
We may able to use chrony setup script on future version of RHEL/CentOS,
it better to run chrony setup when RHEL version >= 8, not only 8.

Note that on Fedora it still provides ntp/ntpdate package, so we run
ntp setup on it for now. (same on debian variants)

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191203192812.5861-1-syuu@scylladb.com>
2019-12-04 10:59:14 +02:00
Juliusz Stasiewicz
430b2ad19d commitlog+region_group: timeout exceptions with names
`segment_manager' now uses a decorated version of `timed_out_error'
with hardcoded name. On the other hand `region_group' uses named
`on_request_expiry' within its `expiring_fifo'.
2019-12-03 19:07:19 +01:00
Avi Kivity
91d3f2afce docs: maintainers.md: fix typo in git push --force-with-lease
Just one lease, not many.

Reported by Piotr Sarna.
2019-12-03 18:17:46 +01:00
Calle Wilund
56a5e0a251 commitlog_replayer: Ensure applied frozen_mutation is safe during apply
Fixes #5211

In 79935df959 replay apply-call was
changed from one with no continuation to one with. But the frozen
mutation arg was still just lambda local.

Change to use do_with for this case as well.

Message-Id: <20191203162606.1664-1-calle@scylladb.com>
2019-12-03 18:28:01 +02:00
Juliusz Stasiewicz
d043393f52 db+semaphores+tests: mandatory `name' param in reader_concurrency_semaphore
Exception messages contain semaphore's name (provided in ctor).
This affects the queue overflow exception as well as timeout
exception. Also, custom throwing function in ctor was changed
to `prethrow_action', i.e. metrics can still be updated there but
now callers have no control over the type of the exception being
thrown. This affected `restricted_reader_max_queue_length' test.
`reader_concurrency_semaphore'-s docs are updated accordingly.
2019-12-03 15:41:34 +01:00
Amos Kong
e26b396f16 scylla-docker: fix default data_directories in scyllasetup.py (#5399)
Use default data_file_directories if it's not assigned in scylla.yaml

Fixes #5398

Signed-off-by: Amos Kong <amos@scylladb.com>
2019-12-03 13:58:17 +02:00
Rafael Ávila de Espíndola
1cd17887fa build: strip debug when configured with --debuginfo 0
In a build configured with --debuginfo 0 the scylla binary still ends
up with some debug info from the libraries that are statically linked
in.

We should avoid compiling subprojects (including seastar) with debug
info when none is needed, but this at least avoids it showing up in
the binary.

The main motivation for this is that it is confusing to get a binary
with *some* debug info in it.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191127215843.44992-1-espindola@scylladb.com>
2019-12-03 12:41:04 +02:00
Tomasz Grabiec
0a453e5d30 Merge "Use fragmented buffers for collection de/serialization" from Botond
This series refactors the collection de/serialization code to use
fragmented buffers, avoiding the large allocations and the associated
pains when working with large collections. Currently all operations that
involve collections require deserializing them, executing the operation,
then serializing them again to their internal storage format. The
de/serialization operations happen in linearized buffers, which means
that we have to allocate a buffer large enough to hold the *entire*
collection. This can cause immense pressure on the memory allocator,
which, in the face of memory fragmentation, might be unable to serve the
allocation at all. We've seen this causing all sorts of nasty problems,
including but not limited to: failing compactions, failing memtable
flush, OOM crash and etc.

Users are strongly discouraged from using large collections, yet they
are still a fact of life and have been haunting us since forever.

The proper solution for these problems would be to come up with an
in-memory format for collections, however that is a major effort, with a
lot of unknowns. This is something we plan on doing at some point but
until it happens we should make life less painful for those with large
collections.

The goal of this series is to avoid the need of allocating these large
buffers. Serialization now happens into a `bytes_ostream` which
automatically fragments the values internally. Deserialization happens
with `utils::linearizing_input_stream` (introduced by this series), which
linearizes only the individual collection cells, but not the entire
collection.
An important goal of this series was to introduce the least amount of
risk, and hence the least amount of code. This series does not try to
make a revolution and completely revamp and optimize the
de/serialization codepaths. These codepaths have their days numbered so
investing a lot of effort into them is in vain. We can apply incremental
optimizations where we deem it necessary.

Fixes: #5341
2019-12-03 10:31:34 +01:00
fastio
01599ffbae Redis API: Support the syntax of deleting multiple keys in one DEL command, fix the returning value for GET command.
Support to delete multiple keys in one DEL command.
The feature of returning number of the really deleted keys is still not supported.
Return empty string to client for GET command when the required key is not exists.

Fixes: #5334

Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>
2019-12-03 17:27:40 +08:00
fastio
039b83ad3b Redis API: Rename options related to Redis API, describe them clearly, and remove unnecessary one.
Rename option redis_transport_port to redis_port, which the redis transport listens on for clients.
Rename option redis_transport_port_ssl to redis_ssl_port, which the redis TLS transport listens on for clients.
Rename option redis_database_count. Set the redis dabase count.
Rename option redis_keyspace_opitons to redis_keyspace_replication_strategy_options. Set the replication strategy for redis keyspace.
Remove option enable_redis_protocol, which is unnecessary.

Fixes: #5335

Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>
2019-12-03 17:13:35 +08:00
Nadav Har'El
7b93360c8d Merge: redis: skip processing request of EOF
Merged pull request https://github.com/scylladb/scylla/pull/5393/ by
Amos Kong:
`
When I test the redis cmd by echo and nc, there is a redundant error in the end.
I checked by strace, currently if client read nothing from stdin, it will
shutdown the socket, redis server will read nothing (0 byte) from socket. But
it tries to process the empty command and returns an error.

$ echo -n -e '*1\r\n$4\r\nping\r\n' |strace nc localhost 6379
| ...
|    read(0, "*1\r\n$4\r\nping\r\n", 8192)   = 14
|    select(5, [4], [4], [], NULL)           = 1 (out [4])
|>>> sendto(4, "*1\r\n$4\r\nping\r\n", 14, 0, NULL, 0) = 14
|    select(5, [0 4], [], [], NULL)          = 1 (in [0])
|    recvfrom(0, 0x7ffe4d5b6c70, 8192, 0, 0x7ffe4d5b6bf0, 0x7ffe4d5b6bec) = -1 ENOTSOCK (Socket operation on non-socket)
|    read(0, "", 8192)                       = 0
|>>> shutdown(4, SHUT_WR)                    = 0
|    select(5, [4], [], [], NULL)            = 1 (in [4])
|    recvfrom(4, "+PONG\r\n-ERR unknown command ''\r\n", 8192, 0, 0x7ffe4d5b6bf0, [0]) = 32
|    write(1, "+PONG\r\n-ERR unknown command ''\r\n", 32+PONG
|    -ERR unknown command ''
|    ) = 32
|    select(5, [4], [], [], NULL)            = 1 (in [4])
|    recvfrom(4, "", 8192, 0, 0x7ffe4d5b6bf0, [0]) = 0
|    close(1)                                = 0
|    close(4)                                = 0

Current result:
  $ echo -n -e '' |nc localhost 6379
  -ERR unknown command ''
  $ echo -n -e '*1\r\n$4\r\nping\r\n' |nc localhost 6379
  +PONG
  -ERR unknown command ''

Expected:
  $ echo -n -e '' |nc localhost 6379
  $ echo -n -e '*1\r\n$4\r\nping\r\n' |nc localhost 6379
  +PONG
2019-12-03 10:40:20 +02:00
Avi Kivity
83feb9ea77 tools: toolchain: update frozen image
Commit 96009881d8 added diffutils to the dependencies via
Seastar's install-dependencies.sh, after it was inadvertantly
dropped in 1164ff5329 (update to Fedora 31; diffutils is no
longer brought in as a side effect of something else).

Regenerate the image to include diffutils.

Ref #5401.
2019-12-03 10:36:55 +02:00
Amos Kong
fb9af2a86b redis-test: add test_raw_cmd.py
This patch added subtests for EOF process, it reads and writes the socket
directly by using protocol cmds.

We can add more tests in future, tests with Redis module will hide some
protocol error.

Signed-off-by: Amos Kong <amos@scylladb.com>
2019-12-03 10:47:56 +08:00
Amos Kong
4fa862adf4 redis: skip processing request of EOF
When I test the redis cmd by echo and nc, there is a redundant error in the end.
I checked by strace, currently if client read nothing from stdin, it will
shutdown the socket, redis server will read nothing (0 byte) from socket. But
it tries to process the empty command and returns an error.

$ echo -n -e '*1\r\n$4\r\nping\r\n' |strace nc localhost 6379
| ...
|    read(0, "*1\r\n$4\r\nping\r\n", 8192)   = 14
|    select(5, [4], [4], [], NULL)           = 1 (out [4])
|>>> sendto(4, "*1\r\n$4\r\nping\r\n", 14, 0, NULL, 0) = 14
|    select(5, [0 4], [], [], NULL)          = 1 (in [0])
|    recvfrom(0, 0x7ffe4d5b6c70, 8192, 0, 0x7ffe4d5b6bf0, 0x7ffe4d5b6bec) = -1 ENOTSOCK (Socket operation on non-socket)
|    read(0, "", 8192)                       = 0
|>>> shutdown(4, SHUT_WR)                    = 0
|    select(5, [4], [], [], NULL)            = 1 (in [4])
|    recvfrom(4, "+PONG\r\n-ERR unknown command ''\r\n", 8192, 0, 0x7ffe4d5b6bf0, [0]) = 32
|    write(1, "+PONG\r\n-ERR unknown command ''\r\n", 32+PONG
|    -ERR unknown command ''
|    ) = 32
|    select(5, [4], [], [], NULL)            = 1 (in [4])
|    recvfrom(4, "", 8192, 0, 0x7ffe4d5b6bf0, [0]) = 0
|    close(1)                                = 0
|    close(4)                                = 0

Current result:
  $ echo -n -e '' |nc localhost 6379
  -ERR unknown command ''
  $ echo -n -e '*1\r\n$4\r\nping\r\n' |nc localhost 6379
  +PONG
  -ERR unknown command ''

Expected:
  $ echo -n -e '' |nc localhost 6379
  $ echo -n -e '*1\r\n$4\r\nping\r\n' |nc localhost 6379
  +PONG

Signed-off-by: Amos Kong <amos@scylladb.com>
2019-12-03 10:47:56 +08:00
Rafael Ávila de Espíndola
bb114de023 dbuild: Fix confusion about relabeling
podman needs to relabel directories in exactly the same cases docker
does. The difference is that podman cannot relabel /tmp.

The reason it was working before is that in practice anyone using
dbuild has already relabeled any directories that need relabeling,
with the exception of /tmp, since it is recreated on every boot.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191201235614.10511-2-espindola@scylladb.com>
2019-12-02 18:38:16 +02:00
Rafael Ávila de Espíndola
867cdbda28 dbuild: Use a temporary directory for /tmp
With this we don't have to use --security-opt label=disable.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191201235614.10511-1-espindola@scylladb.com>
2019-12-02 18:38:14 +02:00
Botond Dénes
1d1f8b0d82 tests: mutation_test: add large collection allocation test
Checking that there are no large allocations when a large collection is
de/serialized.
2019-12-02 17:13:53 +02:00
Avi Kivity
28355af134 docs: add maintainer's handbook (#5396)
This is a list of recipes used by maintainers to maintain
scylla.git.
2019-12-02 15:01:54 +02:00
Calle Wilund
8c6d6254cf cdc: Remove some code from header 2019-12-02 13:00:19 +00:00
Botond Dénes
4c59487502 collection_mutation: don't linearize the buffer on deserialization
Use `utils::linearizing_input_stream` for the deserizalization of the
collection. Allows for avoiding the linearization of the entire cell
value, instead only linearizing individual values as they are
deserialized from the buffer.
2019-12-02 10:10:31 +02:00
Botond Dénes
690e9d2b44 utils: introduce linearizing_input_stream
`linearizing_input_stream` allows transparently reading linearized
values from a fragmented buffer. This is done by linearizing on-the-fly
only those read values that happen to be split across multiple
fragments. This reduces the size of the largest allocation from the size
of the entire buffer (when the entire buffer is linearized) to the size
of the largest read value. This is a huge gain when the buffer contains
loads of small objects, and modest gains when the buffer contains few
large objects. But the even in the worst case the size of the largest
allocation will be less or equal compared to the case where the entire
buffer is linearized.

This stream is planned to be used as glue code between the fragmented
cell value and the collection deserialization code which expects to be
reading linearized values.
2019-12-02 10:10:31 +02:00
Botond Dénes
065d8d37eb tests: random-utils: get_string(): add overload that takes engine parameter 2019-12-02 10:10:31 +02:00
Botond Dénes
2f9307c973 collection_mutation: use a fragmented buffer for serialization
For the serialization `bytes_ostream` is used.
2019-12-02 10:10:31 +02:00
Botond Dénes
fc5b096f73 imr: value_writer::write_to_destination(): don't dereference chunk iterator eagerly
Currently the loop which writes the data from the fragmented origin to
the destination, moves to the next chunk eagerly after writing the value
of the current chunk, if the current chunk is exhausted.
This presents a problem when we are writing the last piece of data from
the last chunk, as the chunk will be exhausted and we eagerly attempt to
move to the next chunk, which doesn't exist and dereferencing it will
fail. The solution is to not be eager about moving to the next chunk and
only attempt it if we actually have more data to write and hence expect
more chunks.
2019-12-02 10:10:31 +02:00
Botond Dénes
875314fc4b bytes_ostream: make it a FragmentRange
The presence of `const_iterator` seems to be a requirement as well
although it is not part of the concept. But perhaps it is just an
assumption made by code using it.
2019-12-02 10:10:31 +02:00
Botond Dénes
4054ba0c45 serialization: accept any CharOutputIterator
Not just bytes::output_iterator. Allow writing into streams other than
just `bytes`. In fact we should be very careful with writing into
`bytes` as they require potentially large contiguous allocations.

The `write()` method is now templatized also on the type of its first
argument, which now accepts any CharOutputIterator. Due to our poor
usage of namespace this now collides with `write` defined inside
`db/commitlog/commitlog.cc`. Luckily, the latter doesn't really have to
be templatized on the data type it reads from, and de-templatizing it
resolves the clash.
2019-12-02 10:10:31 +02:00
Botond Dénes
07007edab9 bytes_ostream: add output_iterator
To allow it being used for serialization code, which works in terms of
output iterators.
2019-12-02 10:10:31 +02:00
Takuya ASADA
c5a95210fe dist/common/scripts/scylla_setup: list virtio-blk devices correctly on interactive RAID setup
Currently interactive RAID setup prompt does not list virtio-blk devices due to
following reasons:
 - We fail matching '-p' option on 'lsblk --help' output since misusage of
   regex functon, list_block_devices() always skipping to use lsblk output.
 - We don't check existance of /dev/vd* when we skipping to use lsblk.
 - We mistakenly excluded virtio-blk devices on 'lsblk -pnr' output using '-e'
   option, but we actually needed them.

To fix the problem we need to use re.search() instead of re.match() to match
'-p' option on 'lsblk --help', need to add '/dev/vd*' on block device list,
then need to stop '-e 252' option on lsblk which excludes virtio-blk.

Additionally, it better to parse 'TYPE' field of lsblk output, we should skip
'loop' devices and 'rom' devices since these are not disk devices.

Fixes #4066

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191201160143.219456-1-syuu@scylladb.com>
2019-12-01 18:36:48 +02:00
Takuya ASADA
124da83103 dist/common/scripts: use chrony as NTP server on RHEL8/CentOS8
We need to use chrony as NTP server on RHEL8/CentOS8, since it dropped
ntpd/ntpdate.

Fixes #4571

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191101174032.29171-1-syuu@scylladb.com>
2019-12-01 18:35:03 +02:00
Nadav Har'El
b82417ba27 Merge "alternator: Implement Expected operators LE, GE, and BETWEEN"
Merged pull request https://github.com/scylladb/scylla/pull/5392 from
Dejan Mircevski.

Refs #5034

The patches:
  alternator: Implement LE operator in Expected
  alternator: Implement GE operator in Expected
  alternator: Make cmp diagnostic a value, not funct
  utils: Add operator<< for big_decimal
  alternator: Implement BETWEEN operator in Expected
2019-12-01 16:11:11 +02:00
Nadav Har'El
8614c30bcf Merge "implement echo command"
Merged pull request https://github.com/scylladb/scylla/pull/5387 from
Amos Kong:

This patch implemented echo command, which return the string back to client.

Reference:

    https://redis.io/commands/echo
2019-12-01 10:29:57 +02:00
Amos Kong
49fee4120e redis-test: add test_echo
Signed-off-by: Amos Kong <amos@scylladb.com>
2019-11-30 13:32:00 +08:00
Amos Kong
3e2034f07b redis: implement echo command
This patch implemented echo command, which return the string back to client.

Reference:
- https://redis.io/commands/echo

Signed-off-by: Amos Kong <amos@scylladb.com>
2019-11-30 13:30:35 +08:00
Dejan Mircevski
dcb1b360ba alternator: Implement BETWEEN operator in Expected
Enable existing BETWEEN test, and add some more coverage to it.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2019-11-29 16:47:21 -05:00
Dejan Mircevski
c43b286f35 utils: Add operator<< for big_decimal
... and remove an existing duplicate from lua.cc.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2019-11-29 15:32:09 -05:00
Dejan Mircevski
e0d77739cc alternator: Make cmp diagnostic a value, not funct
All check_compare diagnostics are static strings, so there's no need
to call functions to get them.  Instead of a function, make diagnostic
a simple value.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2019-11-29 15:09:05 -05:00
Dejan Mircevski
65cb84150a alternator: Implement GE operator in Expected
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2019-11-29 12:29:08 -05:00
Dejan Mircevski
f201f0eaee alternator: Implement LE operator in Expected
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2019-11-29 11:59:52 -05:00
Avi Kivity
96009881d8 Update seastar submodule
* seastar 8eb6a67a4...166061da3 (3):
  > install-dependencies.sh: add diffutils
  > reactor: replace std::optional (in _network_stack_ready) with compat::optional
  > noncopyable_function: disable -Wuninitialized warning in noncopyable_function_base

Ref #5386.
2019-11-29 12:50:48 +02:00
Tomasz Grabiec
6562c60c86 Merge "test.py: terminate children upon signal" from Kostja
Allows a signal to terminate the outstanding
test tasks, to avoid dangling children.
2019-11-29 12:05:03 +02:00
Pekka Enberg
bb227cf2b4 Merge "Fix default directories in Scylla setup scripts" from Amos
"Fix two problem in scylla_io_setup:

 - Problem 1: paths of default directories is invalid, introduced by
   commit 5ec1915 ("scylla_io_setup: assume default directories under
   /var/lib/scylla").

 - Problem 2: wrong path join, introduced by commit 31ddb21
   ("dist/common/scripts: support nonroot mode on setup scripts").

Fix a problem in scylla_io_setup, scylla_fstrim and scylla_blocktune.py:

  - Fixed default scylla directories when they aren't assigned in
    scylla.yaml"

Fixes #5370

Reviewed-by: Pavel Emelyanov <xemul@scylladb.com>

* 'scylla_io_setup' of git://github.com/amoskong/scylla:
  use parse_scylla_dirs_with_default to get scylla directories
  scylla_io_setup: fix data_file_directories check
  scylla_util: introduce helper to process the default scylla directories
  scylla_util: get workdir by datadir() if it's not assigned in scylla.yaml
  scylla_io_setup: fix path join of default scylla directories
2019-11-29 12:05:03 +02:00
Ultrabug
61f1e6e99c test.py: fix undefined variable 'options' in write_xunit_report() 2019-11-28 19:06:22 +03:00
Ultrabug
5bdc0386c4 test.py: comparison to False should be 'if cond is False:' 2019-11-28 19:06:22 +03:00
Ultrabug
737b1cff5e test.py: use isinstance() for type comparison 2019-11-28 19:06:22 +03:00
Konstantin Osipov
c611325381 test.py: terminate children upon signal
Use asyncio as a more modern way to work with concurrency,
Process signals in an event loop, terminate all outstanding
tests before exiting.

Breaking change: this commit requires Python 3.7 or
newer to run this script. The patch adds a version
check and a message to enforce it.
2019-11-28 19:06:22 +03:00
Botond Dénes
cf24f4fe30 imr: move documentation to docs/
Where all the other documentation is, and hence where people would be
looking for it.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191128144612.378244-1-bdenes@scylladb.com>
2019-11-28 16:47:52 +02:00
Avi Kivity
36dd0140a8 Update seastar submodule
* seastar 5c25de907a...8eb6a67a4b (1):
  > util/backtrace.hh: add missing print.hh include
2019-11-28 16:47:16 +02:00
Benny Halevy
7aef39e400 tracing: one_session_records: keep local tracing ptr
Similar to trace_state keep shared_ptr<tracing> _local_tracing_ptr
in one_session_records when constructed so it can be used
during shutdown.

Fixes #5243

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-11-28 15:24:10 +01:00
Gleb Natapov
75499896ab client_state: store _user as optional instead of shared_ptr
_user cannot outlive client_state class instance, so there is no point
in holding it in shared_ptr.

Tested: debug test.py and dtest auth_test.py

Message-Id: <20191128131217.26294-5-gleb@scylladb.com>
2019-11-28 15:48:59 +02:00
Gleb Natapov
1538cea043 cql: modification_statement: store _restrictions as optional instead of shared_ptr
_restrictions can be optional since its lifetime is managed by
modification_statement class explicitly.

Message-Id: <20191128131217.26294-4-gleb@scylladb.com>
2019-11-28 15:48:54 +02:00
Gleb Natapov
ce5d6d5eee storage_service: store thrift server as an optional instead of shared_ptr
Only do_stop_rpc_server uses the shared_ptr to prolong server's
lifetime until stop() completes, but do_with() can be used to achieve the
same.

Message-Id: <20191128131217.26294-3-gleb@scylladb.com>
2019-11-28 15:48:51 +02:00
Gleb Natapov
b9b99431a8 storage_service: store cql server as an optional instead of shared_ptr
Only do_stop_native_transport() uses the shared_ptr to prolong server's
lifetime until stop() completes, but do_with() can be used to achieve the
same.

Message-Id: <20191128131217.26294-2-gleb@scylladb.com>
2019-11-28 15:48:47 +02:00
Avi Kivity
2b7e97514a Update seastar submodule
* seastar 6f0ef32514...5c25de907a (7):
  > shared_future: Fix crash when all returned futures time out
Fixes #5322.
  > future: don't create temporaries on get_value().
  > reactor: lower the default stall threshold to 200ms
  > reactor: Simplify network initialization
  > reactor: Replace most std::function with noncopyable_function
  > futures: Avoid extra moves in SEASTAR_TYPE_ERASE_MORE mode
  > inet_address: Make inet_address == operator ignore scope (again)
2019-11-28 14:48:01 +02:00
Juliusz Stasiewicz
fa12394dfe reader_concurrency_semaphore: cosmetic changes
Added line breaks, replaced unused include, included seastarx.hh
instead of `using namespace seastar`.
2019-11-28 13:39:08 +01:00
Nadav Har'El
fde336a882 Merged "5139 minmax bad printing"
Merged pull request https://github.com/scylladb/scylla/pull/5311 from
Juliusz Stasiewicz:

This is a partial solution to #5139 (only for two types) because of the
above and because collections are much harder to do. They are coming in
a separate PR.
2019-11-28 14:06:43 +02:00
Juliusz Stasiewicz
3b9ebca269 tests/cql_query_test: add test for aggregates on inet+time_type
This is a test to max(), min() and count() system functions on
the arguments of types: `net::inet_address` and `time_native_type`.
2019-11-28 11:20:43 +01:00
Juliusz Stasiewicz
9c23d89531 cql3/functions: add missing min/max/count for inet and time type
References #5139. Aggregate functions, like max(), when invoked
on `inet_address' and `time_native_type' used to choose
max(blob)->blob overload, with casting of argument and result to
bytes. This is because appropriate calls to
`aggregate_fcts::make_XXX_function()' were missing. This commit
adds them. Functioning remains the same but now clients see
user-friendly representations of aggregate result, not binary.

Comparing inet addresses without inet::operator< is performed by
trick, where ADL is bypassed by wrapping the name of std::min/max
and providing an overload of wrapper on inet type.
2019-11-28 11:18:31 +01:00
Pavel Emelyanov
8532093c61 cql: The cql_server does not need proxy reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191127153842.4098-1-xemul@scylladb.com>
2019-11-28 10:58:46 +01:00
Amos Kong
e2eb754d03 use parse_scylla_dirs_with_default to get scylla directories
Use default data_file_directories/commitlog_directory if it's not assigned
in scylla.yaml

Signed-off-by: Amos Kong <amos@scylladb.com>
2019-11-28 15:48:14 +08:00
Amos Kong
bd265bda4f scylla_io_setup: fix data_file_directories check
Use default data_file_directories if it's not assigned in scylla.yaml

Signed-off-by: Amos Kong <amos@scylladb.com>
2019-11-28 15:47:56 +08:00
Amos Kong
123c791366 scylla_util: introduce helper to process the default scylla directories
Currently we support to assign workdir from scylla.yaml, and we use many
hardcode '/var/lib/scylla' in setup scripts.

Some setup scripts get scylla directories by parsing scylla.yaml, introduced
parse_scylla_dirs_with_default() that adds default values if scylla directories
aren't assigned in scylla.yaml

Signed-off-by: Amos Kong <amos@scylladb.com>
2019-11-28 14:54:32 +08:00
Amos Kong
b75061b4bc scylla_util: get workdir by datadir() if it's not assigned in scylla.yaml
Signed-off-by: Amos Kong <amos@scylladb.com>
2019-11-28 14:38:01 +08:00
Amos Kong
ada0e92b85 scylla_io_setup: fix path join of default scylla directories
Currently we are checking an invalid path of some default scylla directories,
the directories don't exist, so the tune will always be skipped. It caused by
two problem.

Problem 1: paths of default directories is invalid

Introduced by commit 5ec191536e, we try to tune some scylla default directories
if they exist. But the directory paths we try are wrong.

For example:
- What we check: /var/lib/scylla/commitlog_directory
- Correct one: /var/lib/scylla/commitlog

Problem 2: wrong path join

Introduced by commit 31ddb2145a, default_path might be replaced from
'/var/lib/scylla/' to '/var/lib/scylla'.

Our code tries to check an invalid path that is wrongly join, eg:
'/var/lib/scyllacommitlog'

Signed-off-by: Amos Kong <amos@scylladb.com>
2019-11-28 14:37:58 +08:00
Amos Kong
d4a26f2ad0 scylla_util: get_scylla_dirs: return default data/commitlog directories if they aren't set (#5358)
The default values of data_file_directories and commitlog_directory were
commented by commit e0f40ed16a. It causes scylla_util.py:get_scylla_dirs() to
fail in checking the values.

This patch changed get_scylla_dirs() to return default data/commitlog
directories if they aren't set.

Fixes #5358 

Reviewed-by: Pavel Emelyanov <xemul@scylladb.com>
Signed-off-by: Amos Kong <amos@scylladb.com>
2019-11-27 13:52:05 +02:00
Nadav Har'El
cb1ed5eab2 alternator-test: test Query's Limit parameter
Add a test, test_query.py::test_query_limit, to verify that the Limit
parameter correctly limits the number of rows returned by the Query.
This was supposed to already work correctly - but we never had a test for
it. As we hoped, the test passes (on both Alternator and DynamoDB).

Another test, test_query.py::test_query_limit_paging, verifies that
paging can be done with any setting of Limit. We already had tests
for paging of the Scan operation, but not for the Query operation.

Refs #5153

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2019-11-27 12:27:26 +01:00
Nadav Har'El
c01ca661a0 alternator-test: Select parameter of Query and Scan
This is a comprehensive test for the "Select" parameter of Query and Scan
operations, but only for the base-table case, not index, so another future
patch should add similar tests in test_gsi.py and test_lsi.py as well.

The main use of the Select parameter is to allow returning just the count
of items, instead of their content, but it also has other esoteric options,
all of which we test here.

The test currently succeeds on AWS DynamoDB, demonstrating that the test
is correct, but fails on Alternator because the "Select" parameter is not
yet supported. So the test is marked xfail.

Refs #5058

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2019-11-27 12:22:33 +01:00
Botond Dénes
9d09f57ba5 scylla-gdb.py: scylla_smp_queues: use lazy initalization
Currently the command tries to read all seastar smp queues in its
initialization code in the constructor. This constructor is run each
time `scylla-gdb.py` is sourced in `gdb` which leads to slowdowns and
sometimes also annoying errors because the sourcing happens in the wrong
context and seastar symbols are not available.
Avoid this by running this initializing code lazily, on the first
invocation.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191127095408.112101-1-bdenes@scylladb.com>
2019-11-27 12:04:57 +01:00
Tomasz Grabiec
87b72dad3e Merge "treewide: add missing const qualifiers" from Pavel Solodovnikov
This patchset adds missing "const" function qualifiers throughout
the Scylla code base, which would make code less error-prone.

The changeset incorporates Kostja's work regarding const qualifiers
in the cql code hierarchy along with a follow-up patch addressing the
review comment of the corresponding patch set (the patch subject is
"cql: propagate const property through prepared statement tree.").
2019-11-27 10:56:20 +01:00
Rafael Ávila de Espíndola
91b43f1f06 dbuild: fix podman with selinux enabled
With this change I am able to run tests using docker-podman. The
option also exists in docker.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191126194101.25221-1-espindola@scylladb.com>
2019-11-26 21:50:56 +02:00
Rafael Ávila de Espíndola
480055d3b5 dbuild: Fix missing docker options
With the recent changes docker was missing a few options. In
particular, it was missing -u.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191126194347.25699-1-espindola@scylladb.com>
2019-11-26 21:45:31 +02:00
Rafael Ávila de Espíndola
c0a2cd70ff lua: fix test with boost 1.66
The boost 1.67 release notes says

Changed maximum supported year from 10000 to 9999 to resolve various issues

So change the test to use a larger number so that we get an exception
with both boost 1.66 and boost 1.67.

Fixes #5344

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191126180327.93545-1-espindola@scylladb.com>
2019-11-26 21:17:15 +02:00
Pavel Solodovnikov
55a1d46133 cql: some more missing const qualifiers
There are several virtual functions in public interfaces named "is_*"
that clearly should be marked as "const", so fix that.
2019-11-26 17:57:51 +03:00
Pavel Solodovnikov
412f1f946a cql: remove "mutable" on _opts in select_statement
_opts initialization can be safely done in the constructor, hence no need to make it mutable.
2019-11-26 17:55:10 +03:00
Piotr Sarna
d90dbd6ab0 Merge "support podman as a replacement to docker" from Avi
Docker on Fedora 31 is flakey, and is not supported at all on RHEL 8.
Podman is a drop-in replacement for docker; this series adds support
for using podman in dbuild.

Apart from actually working on Fedora 31 hosts,
podman is nicer in being more secure and not requiring a daemon.

Fixes #5332
2019-11-26 15:17:49 +01:00
Tomasz Grabiec
5c9fe83615 Merge "Sanitize sub-modules shutting down" from Pavel
As suggested in issue #4586 here is the helper that prints
"shutting down foo" message, then shuts the foo down, then
prints the "[it] was successull" one. In between it catches
the exception (if any) and warns this in logs.

By "then" I mean literally then, not the seastar's then() :)

Fixes: #4586
2019-11-26 15:14:22 +02:00
Piotr Sarna
9c5a5a5ac2 treewide: add names to semaphores
By default, semaphore exceptions bring along very little context:
either that a semaphore was broken or that it timed out.
In order to make debugging easier without introducing significant
runtime costs, a notion of named semaphore is added.
A named semaphore is simply a semaphore with statically defined
name, which is present in its errors, bringing valuable context.
A semaphore defined as:

  auto sem = semaphore(0);

will present the following message when it breaks:
"Semaphore broken"
However, a named semaphore:

  auto named_sem = named_semaphore(0, named_semaphore_exception_factory{"io_concurrency_sem"});

will present a message with at least some debugging context:

  "Semaphore broken: io_concurrency_sem"

It's not much, but it would really help in pinpointing bugs
without having to inspect core dumps.

At the same time, it does not incur any costs for normal
semaphore operations (except for its creation), but instead
only uses more CPU in case an error is actually thrown,
which is considered rare and not to be on the hot path.

Refs #4999

Tests: unit(dev), manual: hardcoding a failure in view building code
2019-11-26 15:14:21 +02:00
Avi Kivity
6fbb724140 conf: remove unsupported options from scylla.yaml (#5299)
These unsupported options do nothing except to confuse users who
try to tune them.

Options removed:

hinted_handoff_throttle_in_kb
max_hints_delivery_threads
batchlog_replay_throttle_in_kb
key_cache_size_in_mb
key_cache_save_period
key_cache_keys_to_save
row_cache_size_in_mb
row_cache_save_period
row_cache_keys_to_save
counter_cache_size_in_mb
counter_cache_save_period
counter_cache_keys_to_save
memory_allocator
saved_caches_directory
concurrent_reads
concurrent_writes
concurrent_counter_writes
file_cache_size_in_mb
index_summary_capacity_in_mb
index_summary_resize_interval_in_minutes
trickle_fsync
trickle_fsync_interval_in_kb
internode_authenticator
native_transport_max_threads
native_transport_max_concurrent_connections
native_transport_max_concurrent_connections_per_ip
rpc_server_type
rpc_min_threads
rpc_max_threads
rpc_send_buff_size_in_bytes
rpc_recv_buff_size_in_bytes
internode_send_buff_size_in_bytes
internode_recv_buff_size_in_bytes
thrift_framed_transport_size_in_mb
concurrent_compactors
compaction_throughput_mb_per_sec
sstable_preemptive_open_interval_in_mb
inter_dc_stream_throughput_outbound_megabits_per_sec
cross_node_timeout
streaming_socket_timeout_in_ms
dynamic_snitch_update_interval_in_ms
dynamic_snitch_reset_interval_in_ms
dynamic_snitch_badness_threshold
request_scheduler
request_scheduler_options
throttle_limit
default_weight
weights
request_scheduler_id
2019-11-26 15:14:21 +02:00
Amos Kong
817f34d1a9 ami: support new aws instance types: c5d, m5d, m5ad, r5d, z1d (#5330)
Currently scylla_io_setup will skip in scylla_setup, because we didn't support
those new instance types.

I manually executed scylla_io_setup, and the scylla-server started and worked
well.

Let's apply this patch first, then check if there is some new problem in
ami-test.

Signed-off-by: Amos Kong <amos@scylladb.com>
2019-11-26 15:14:21 +02:00
Konstantin Osipov
90346236ac cql: propagate const property through prepared statement tree.
cql_statement is a class representing a prepared statement in Scylla.
It is used concurrently during execution, so it is important that its
change is not changed by execution.

Add const qualifier to the execution methods family, throghout the
cql hierarchy.

Mark a few places which do mutate prepared statement state during
execution as mutable. While these are not affecting production today,
as code ages, they may become a source of latent bugs and should be
moved out of the prepared state or evaluated at prepare eventually:

cf_property_defs::_compaction_strategy_class
list_permissions_statement::_resource
permission_altering_statement::_resource
property_definitions::_properties
select_statement::_opts
2019-11-26 14:18:17 +03:00
Pavel Solodovnikov
2f442f28af treewide: add const qualifiers throughout the code base 2019-11-26 02:24:49 +03:00
Pavel Emelyanov
50a1ededde main: Remove now unused defer-with-log helper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-11-25 18:47:03 +03:00
Pavel Emelyanov
a0f92d40ee main: Shut down sighup handler with verbose helper
And (!) fix the misprinted variable name.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-11-25 18:47:03 +03:00
Pavel Emelyanov
0719369d83 repair: Remove extra logging on shutdown
The shutdown start/finish messages are already printed in verbose_shutdown()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-11-25 18:47:03 +03:00
Pavel Emelyanov
2d64fc3a3e main: Shut down database with verbose_shutdown helper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-11-25 18:47:03 +03:00
Pavel Emelyanov
636c300db5 main: Shut down prometheus with verbose_shutdown()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

---

v2:
- Have stop easrlier so that exception in start/listen do
  not prevent prometheu.stop from calling
2019-11-25 18:47:03 +03:00
Pavel Emelyanov
804b152527 main: Sanitize shutting down callbacks
As suggested in issue #4586 here is the helper that prints
"shutting down foo" message, then shuts the foo down, then
prints the "shutting down foo was successfull". In between
it catches the exception (if any) and warns this in logs.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-11-25 18:45:49 +03:00
Nadav Har'El
4160b3630d Merge "Return preimage from CDC only when it's enabled"
Merged pull request https://github.com/scylladb/scylla/pull/5218
from Piotr Jastrzębski:

Users should be able to decide whether they need preimage or not. There is
already an option for that but it's not respected by the implementation.
This PR adds support for this functionality.

Tests: unit(dev).

Individual patches:
  cdc: Don't take storage_proxy as transformer::pre_image_select param
  cdc::append_log_mutations: use do_with instead of shared_ptr
  cdc::append_log_mutations: fix undefined behavior
  cdc: enable preimage in test_pre_image_logging test
  cdc: Return preimage only when it's requested
  cdc: test both enabled and disabled preimage in test_pre_image_logging
2019-11-25 14:32:17 +02:00
Pavel Emelyanov
f6ac969f1e mm: Stop migration manager
Before stopping the db itself, stop the migration service.
It must be stopped before RPC, but RPC is not stopped yet
itself, so we should be safe here.

Here's the tail of the resulting logs:

INFO  2019-11-20 11:22:35,193 [shard 0] init - shutdown migration manager
INFO  2019-11-20 11:22:35,193 [shard 0] migration_manager - stopping migration service
INFO  2019-11-20 11:22:35,193 [shard 1] migration_manager - stopping migration service
INFO  2019-11-20 11:22:35,193 [shard 0] init - Shutdown database started
INFO  2019-11-20 11:22:35,193 [shard 0] init - Shutdown database finished
INFO  2019-11-20 11:22:35,193 [shard 0] init - stopping prometheus API server
INFO  2019-11-20 11:22:35,193 [shard 0] init - Scylla version 666.development-0.20191120.25820980f shutdown complete.

Also -- stop the mm on drain before the commitlog it stopped.
[Tomasz: mm needs the cl because pulling schema changes from other nodes
involves applying them into the database. So cl/db needs to be
stopped after mm is stopped.]

The drain logs would look like

...
INFO  2019-11-25 11:00:40,562 [shard 0] migration_manager - stopping migration service
INFO  2019-11-25 11:00:40,562 [shard 1] migration_manager - stopping migration service
INFO  2019-11-25 11:00:40,563 [shard 0] storage_service - DRAINED:

and then on stop

...
INFO  2019-11-25 11:00:46,427 [shard 0] init - shutdown migration manager
INFO  2019-11-25 11:00:46,427 [shard 0] init - Shutdown database started
INFO  2019-11-25 11:00:46,427 [shard 0] init - Shutdown database finished
INFO  2019-11-25 11:00:46,427 [shard 0] init - stopping prometheus API server
INFO  2019-11-25 11:00:46,427 [shard 0] init - Scylla version 666.development-0.20191125.3eab6cd54 shutdown complete.

Fixes #5300

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191125080605.7661-1-xemul@scylladb.com>
2019-11-25 12:59:01 +01:00
Asias He
6ec602ff2c repair: Fix rx_hashes_nr metrics (#5213)
In get_full_row_hashes_with_rpc_stream and
repair_get_row_diff_with_rpc_stream_process_op which were introduced in
the "Repair switch to rpc stream" series, rx_hashes_nr metrics are not
updated correctly.

In the test we have 3 nodes and run repair on node3, we makes sure the
following metrics are correct.

assertEqual(node1_metrics['scylla_repair_tx_hashes_nr'] + node2_metrics['scylla_repair_tx_hashes_nr'],
   	    node3_metrics['scylla_repair_rx_hashes_nr'])
assertEqual(node1_metrics['scylla_repair_rx_hashes_nr'] + node2_metrics['scylla_repair_rx_hashes_nr'],
   	    node3_metrics['scylla_repair_tx_hashes_nr'])
assertEqual(node1_metrics['scylla_repair_tx_row_nr'] + node2_metrics['scylla_repair_tx_row_nr'],
   	    node3_metrics['scylla_repair_rx_row_nr'])
assertEqual(node1_metrics['scylla_repair_rx_row_nr'] + node2_metrics['scylla_repair_rx_row_nr'],
   	    node3_metrics['scylla_repair_tx_row_nr'])
assertEqual(node1_metrics['scylla_repair_tx_row_bytes'] + node2_metrics['scylla_repair_tx_row_bytes'],
   	    node3_metrics['scylla_repair_rx_row_bytes'])
assertEqual(node1_metrics['scylla_repair_rx_row_bytes'] + node2_metrics['scylla_repair_rx_row_bytes'],
            node3_metrics['scylla_repair_tx_row_bytes'])

Tests: repair_additional_test.py:RepairAdditionalTest.repair_almost_synced_3nodes_test
Fixes: #5339
Backports: 3.2
2019-11-25 13:57:37 +02:00
Piotr Jastrzebski
2999cb5576 cdc: test both enabled and disabled preimage in test_pre_image_logging
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-11-25 12:43:39 +01:00
Piotr Jastrzebski
222b94c707 cdc: Return preimage only when it's requested
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-11-25 12:43:39 +01:00
Piotr Jastrzebski
c94a5947b7 cdc: enable preimage in test_pre_image_logging test
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-11-25 12:43:39 +01:00
Piotr Jastrzebski
595c9f9d32 cdc::append_log_mutations: fix undefined behavior
The code was iterating over a collection that was modified
at the same time. Iterators were used for that and collection
modification can invalidate all iterators.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-11-25 12:43:39 +01:00
Piotr Jastrzebski
f0f44f9c51 cdc::append_log_mutations: use do_with instead of shared_ptr
This will not only safe some allocations but also improve
code readability.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-11-25 12:43:39 +01:00
Piotr Jastrzebski
b8d9158c21 cdc: Don't take storage_proxy as transformer::pre_image_select param
transformer has access to storage_proxy through its _ctx field.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-11-25 12:43:39 +01:00
Nadav Har'El
3eab6cd549 Merged "toolchain: update to Fedora 31"
Merged pull request https://github.com/scylladb/scylla/pull/5310 from
Avi Kivity:

This is a minor update as gcc and boost versions did not change. A noteable
update is patchelf 0.10, which adds support to large binaries.

A few minor issues exposed by the update are fixed in preparatory patches.

Patches:
  dist: rpm: correct systemd post-uninstall scriptlet
  build: force xz compression on rpm binary payload
  tools: toolchain: update to Fedora 31
2019-11-24 13:38:45 +02:00
Tomasz Grabiec
e3d025d014 row_cache: Fix abort on bad_alloc during cache update
Since 90d6c0b, cache will abort when trying to detach partition
entries while they're updated. This should never happen. It can happen
though, when the update fails on bad_alloc, because the cleanup guard
invalidates the cache before it releases partition snapshots (held by
"update" coroutine).

Fix by destroying the coroutine first.

Fixes #5327.

Tests:
  - row_cache_test (dev)

Message-Id: <1574360259-10132-1-git-send-email-tgrabiec@scylladb.com>
2019-11-24 12:06:51 +02:00
Rafael Ávila de Espíndola
8599f8205b rpmbuild: don't use dwz
By default rpm uses dwz to merge the debug info from various
binaries. Unfortunately, it looks like addr2line has not been updated
to handle this:

// This works
$ addr2line  -e build/release/scylla 0x1234567

$ dwz -m build/release/common.debug build/release/scylla.debug build/release/iotune.debug

// now this fails
$ addr2line -e build/release/scylla 0x1234567

I think the issue is

https://sourceware.org/bugzilla/show_bug.cgi?id=23652

Fixes #5289

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191123015734.89331-1-espindola@scylladb.com>
2019-11-24 11:35:29 +02:00
Rafael Ávila de Espíndola
25d5d39b3c reloc: Force using sha1 for build-ids
The default build-id used by lld is xxhash, which is 8 bytes long. rpm
requires build-ids to be at least 16 bytes long
(https://github.com/rpm-software-management/rpm/issues/950). We force
using sha1 for now. That has no impact in gold and bfd since that is
their default. We set it in here instead of configure.py to not slow
down regular builds.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191123020801.89750-1-espindola@scylladb.com>
2019-11-24 11:35:29 +02:00
Rafael Ávila de Espíndola
b5667b9c31 build: don't compress debug info in executables
By default we were compressing debug info only in release
executables. The idea, if I understand it correctly, is that those are
the ones we ship, so we want a more compact binary.

I don't think that was doing anything useful. The compression is just
gzip, so when we ship a .tar.xz, having the debug info compressed
inside the scylla binary probably reduces the overall compression a
bit.

When building a rpm the situation in amusing. As part of the rpm
build process the debug info is decompressed and extracted to an
external file.

Given that most of the link time goes to compressing debug info, it is
probably a good idea to just skip that.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191123022825.102837-1-espindola@scylladb.com>
2019-11-24 11:35:29 +02:00
Tomasz Grabiec
d84859475e Merge "Refactor test.py and cleanup resources" from Kostja
Structure the code to be able to introduce futures.
Apply trivial cleanups.
Switch to asyncio and use it to work with processes and
handle signals. Cleanup all processes upon signal.
2019-11-24 11:35:29 +02:00
Tomasz Grabiec
e166fdfa26 Merge "Optimize LWT query phase" from Vladimir Davydov
This patch implements a simple optimization for LWT: it makes PAXOS
prepare phase query locally and return the current value of the modified
key so that a separate query is not necessary. For more details see
patch 6. Patch 1 fixes a bug in next. Patches 2-5 contain trivial
preparatory refactoring.
2019-11-24 11:35:29 +02:00
Pavel Solodovnikov
4879db70a6 system_keyspace: support timeouts in queries to system.paxos table.
Also introduce supplementary `execute_cql_with_timeout` function.

Remove redundant comment for `execute_cql`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20191121214148.57921-1-pa.solodovnikov@scylladb.com>
2019-11-24 11:35:29 +02:00
Vladimir Davydov
bf5f864d80 paxos: piggyback result query on prepare response
Current LWT implementation uses at least three network round trips:
 - first, execute PAXOS prepare phase
 - second, query the current value of the updated key
 - third, propose the change to participating replicas

(there's also learn phase, but we don't wait for it to complete).

The idea behind the optimization implemented by this patch is simple:
piggyback the current value of the updated key on the prepare response
to eliminate one round trip.

To generate less network traffic, only the closest to the coordinator
replica sends data while other participating replicas send digests which
are used to check data consistency.

Note, this patch changes the API of some RPC calls used by PAXOS, but
this should be okay as long as the feature in the early development
stage and marked experimental.

To assess the impact of this optimization on LWT performance, I ran a
simple benchmark that starts a number of concurrent clients each of
which updates its own key (uncontended case) stored in a cluster of
three AWS i3.2xlarge nodes located in the same region (us-west-1) and
measures the aggregate bandwidth and latency. The test uses shard-aware
gocql driver. Here are the results:

                latency 99% (ms)    bandwidth (rq/s)    timeouts (rq/s)
    clients     before  after       before  after       before  after
          1          2      2          626    637            0      0
          5          4      3         2616   2843            0      0
         10          3      3         4493   4767            0      0
         50          7      7        10567  10833            0      0
        100         15     15        12265  12934            0      0
        200         48     30        13593  14317            0      0
        400        185     60        14796  15549            0      0
        600        290     94        14416  15669            0      0
        800        568    118        14077  15820            2      0
       1000        710    118        13088  15830            9      0
       2000       1388    232        13342  15658           85      0
       3000       1110    363        13282  15422          233      0
       4000       1735    454        13387  15385          329      0

That is, this optimization improves max LWT bandwidth by about 15%
and allows to run 3-4x more clients while maintaining the same level
of system responsiveness.
2019-11-24 11:35:29 +02:00
Rafael Ávila de Espíndola
6160b9017d commitlog: make sure a file is closed
If allocate or truncate throws, we have to close the file.

Fixes #4877

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191114174810.49004-1-espindola@scylladb.com>
2019-11-24 11:35:29 +02:00
Vladimir Davydov
3d1d4b018f paxos: remove unnecessary move constructor invocations
invoke_on() guarantees that captures object won't be destroyed until the
future returned by the invoked function is resolved so there's no need
to move key, token, proposal for calling paxos_state::*_impl helpers.
2019-11-24 11:35:29 +02:00
Rafael Ávila de Espíndola
cfb079b2c9 types: Refactor duplicated value_cast implementation
The two implementations of value_cast were almost identical.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191120181213.111758-3-espindola@scylladb.com>
2019-11-24 11:35:29 +02:00
Vladimir Davydov
ef2e96c47c storage_proxy: factor out helper to sort endpoints by proximity
We need it for PAXOS.
2019-11-24 11:35:29 +02:00
Nadav Har'El
854e6c8d7b alternator-test: test_health_only_works_for_root_path: remove wrong check
The test_health_only_works_for_root_path test checks that while Alternator's
HTTP server responds to a "GET /" request with success ("health check"), it
should respond to different URLs with failures (page not found).

One of the URLs it tested was "/..", but unfortunately some versions of
Python's HTTP client canonize this request to just a "/", causing the
request to unexpectedly succeed - and the test to fail.

So this patch just drops the "/.." check. A few other nonsense URLs are
attempted by the test - e.g., "/abc".

Fixes #5321

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2019-11-24 11:35:29 +02:00
Vladimir Davydov
63d4590336 storage_proxy: move digest_algorithm upper
We need it for PAXOS.

Mark it as static inline while we are at it.
2019-11-24 11:35:29 +02:00
Nadav Har'El
43d3e8adaf alternator: make DescribeTable return table schema
One of the fields still missing in DescribeTable's response (Refs #5026)
was the table's schema - KeySchema and AttributeDefinitions.

This patch adds this missing feature, and enables the previously-xfailing
test test_describe_table_schema.

A complication of this patch is that in a table with secondary indexes,
we need to return not just the base table's schema, but also the indexes'
schema. The existing tests did not cover that feature, so we add here
two more tests in test_gsi.py for that.

One of these secondary-index schema tests, test_gsi_2_describe_table_schema,
still fails, because it outputs a range-key which Scylla added to a view
because of its own implementation needs, but wasn't in the user's
definition of the GSI. I opened a separate issue #5320 for that.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2019-11-24 11:35:29 +02:00
Vladimir Davydov
f5c2a23118 serializer: add reference_wrapper handling
Serialize reference_wrapper<T> as T and make sure is_equivalent<> treats
reference_wrapper<T> wrapped in std::optional<> or std::variant<>, or
std::tuple<> as T.

We need it to avoid copying query::result while serializing
paxos::promise.
2019-11-24 11:35:29 +02:00
Botond Dénes
89f9b89a89 scylla-gdb.py: scylla task_histogram: scan all tasks with -a or -s 0
Currently even if `-a` or `-s 0` is provided, `scylla task_histogram`
will scan a limited amount of pages due to a bug in the scan loop's stop
condition, which will be trigger a stop once the default sample limit is
reached. Fix the loop by skipping this check when the user wants to scan
all tasks.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191121141706.29476-1-bdenes@scylladb.com>
2019-11-24 11:35:29 +02:00
Vladimir Davydov
1452653fbc query_context: fix use after free of timeout_config in execute_cql_with_timeout
timeout_config is used by reference by cql3::query_processor::process(),
see cql3::query_options, so the caller must make sure it doesn't go away.
2019-11-24 11:35:29 +02:00
Avi Kivity
ff7e78330c tools: toolchain: dbuild: work around "podman logs --follow" hang
At least some versions of 'podman logs --follow' hang when the
container eventually exits (also happens with docker on recent
versions). Fortunately, we don't need to use 'podman logs --follow'
and can use the more natural non-detached 'podman run', because
podman does not proxy SIGTERM and instead shuts down the container
when it receives it.

So, to work around the problem, use the same code path in interactive
and non-interactive runs, when podman is in use instead of docker.
2019-11-22 13:59:05 +02:00
Avi Kivity
702834d0e4 tools: dbuild: avoid uid/gid/selinux hacks when using podman
With docker, we went to considerable lengths to ensure that
access to mounted volume was done using the calling user, including
supplementary groups. This avoids root-owned files being left around
after a build, and ensures that access to group-shared files (like
/var/cache/ccache) works as expected.

All of this is unnecessary and broken when using podman. Podman
uses a proxy to access files on behalf of the container, so naturally
all access is done using the calling user's identity. Since it remaps
user and group IDs, assigning the host uid/gid is meaningless. Using
--userns host also breaks, because sudo no longer works.

Fix this by making all the uid/gid/selinux games specific to docker and
ignore them when using podman. To preserve the functionality of tools
that depend on $HOME, set that according to the host setting.
2019-11-22 13:58:29 +02:00
Tomasz Grabiec
9d7f8f18ab database: Avoid OOMing with flush continuations after failed memtable flush
The original fix (10f6b125c8) didn't
take into account that if there was a failed memtable flush (Refs
flush) but is not a flushable memtable because it's not the latest in
the memtable list. If that happens, it means no other memtable is
flushable as well, cause otherwise it would be picked due to
evictable_occupancy(). Therefore the right action is to not flush
anything in this case.

Suspected to be observed in #4982. I didn't manage to reproduce after
triggering a failed memtable flush.

Fixes #3717
2019-11-22 12:08:36 +01:00
Tomasz Grabiec
fb28543116 lsa: Introduce operator bool() to occupancy_stats 2019-11-22 12:08:28 +01:00
Tomasz Grabiec
a69fda819c lsa: Expose region_impl::evictable_occupancy in the region class 2019-11-22 12:08:10 +01:00
Avi Kivity
1c181c1b85 tools: dbuild: don't mount duplicate volumes
podman refuses to start with duplicate volumes, which routinely
happen if the toplevel directory is the working directory. Detect
this and avoid the duplicate.
2019-11-22 10:13:30 +02:00
Konstantin Osipov
b8b5834cf1 test.py: simplify message output in run_test() 2019-11-21 23:16:22 +03:00
Konstantin Osipov
90a8f79d7e test.py: use UnitTest class where possible 2019-11-21 23:16:22 +03:00
Konstantin Osipov
8cd8cfc307 test.py: rename harness command line arguments to 'options'
UnitTest class uses juggles with the name 'args' quite a bit to
construct the command line for a unit test, so let's spread
the harness command line arguments from the unit test command line
arguments a bit apart by consistently calling the harness command line
arguments 'options', and unit test command line arguments 'args'.

Rename usage() to parse_cmd_line().
2019-11-21 23:16:22 +03:00
Konstantin Osipov
e5d624d055 test.py: consolidate argument handling in UnitTest constructor
Create unique UnitTest objects in find_tests() for each found match,
including repeat, to ensure each test has its own unique id.
This will also be used to store execution state in the test.
2019-11-21 23:16:22 +03:00
Konstantin Osipov
dd60673cef test.py: move --collectd to standard args 2019-11-21 23:16:22 +03:00
Konstantin Osipov
fe12f73d7f test.py: introduce class UnitTest 2019-11-21 23:16:22 +03:00
Konstantin Osipov
bbcdee37f7 test.py: add add_test_list() to find_tests() 2019-11-21 23:16:22 +03:00
Konstantin Osipov
4723afa09c test.py: add long tests with add_test() 2019-11-21 23:16:22 +03:00
Konstantin Osipov
13f1e2abc6 test.py: store the non-default seastar arguments along with definition 2019-11-21 23:16:22 +03:00
Konstantin Osipov
72ef11eb79 test.py: introduce add_test() to find_tests()
To avoid code duplication, and to build upon later.
2019-11-21 23:16:22 +03:00
Konstantin Osipov
b50b24a8a7 test.py: avoid an unnecessary loop in find_tests() 2019-11-21 23:16:22 +03:00
Konstantin Osipov
a5103d0092 test.py: move args.repeat processing to find_tests()
It somewhat stands in the way of using asyncio

This patch also implements a more comprehensive
fix for #5303, since we not only have --repeat, but
run some tests in different configurations, in which
case xml output is also overwritten.
2019-11-21 23:16:22 +03:00
Konstantin Osipov
0f0a49b811 test.py: introduce print_summary() and write_xunit_report()
(One more moving of the code around).
2019-11-21 23:16:22 +03:00
Konstantin Osipov
22166771ef test.py: rename test_to_run tests_to_run 2019-11-21 23:16:22 +03:00
Konstantin Osipov
1d94d9827e test.py: introduce run_all_tests() 2019-11-21 23:16:22 +03:00
Konstantin Osipov
29087e1349 test.py: move out run_test() routine
(Trivial code refactoring.)
2019-11-21 23:16:22 +03:00
Konstantin Osipov
79506fc5ab test.py: introduce find_tests()
Trivial code refactoring.
2019-11-21 23:16:22 +03:00
Konstantin Osipov
a44a1c4124 test.py: remove print_status_succint
(Trivial code cleanup.)
2019-11-21 23:16:22 +03:00
Konstantin Osipov
b9605c1d37 test.py: move mode list evaluation to usage() 2019-11-21 23:16:22 +03:00
Konstantin Osipov
0c4df5a548 test.py: add usage() 2019-11-21 23:16:22 +03:00
Pavel Emelyanov
e0f40ed16a cli: Add the --workdir|-W option
When starting scylla daemon as non-root the initialization fails
because standard /var/lib/scylla is not accessible by regular users.
Making the default dir accessible for user is not very convenient
either, as it will cause conflicts if two or more instances of scylla
are in use.

This problem can be resolved by specifying --commitlog-directory,
--data-file-directories, etc on start, but it's too much typing. I
propose to revive Nadav's --home option that allows to move all the
directories under the same prefix in one go.

Unlike Nadav's approach the --workdir option doesn't do any tricky
manipulations with existing directories. Insead, as Pekka suggested,
the individual directories are placed under the workir if and only
if the respective option is NOT provided. Otherwise the directory
configuration is taken as is regardless of whether its absolute or
relative path.

The values substutution is done early on start. Avi suggested that
this is unsafe wrt HUP config re-read and proper paths must be
resolved on the fly, but this patch doesn't address that yet, here's
why.

First of all, the respective options are MustRestart now and the
substitution is done before HUP handler is installed.

Next, commitlog and data_file values are copied on start, so marking
the options as LiveUpdate won't make any effect.

Finally, the existing named_value::operator() returns a reference,
so returning a calculated (and thus temporary) value is not possible
(from my current understanding, correct me if I'm wrong). Thus if we
want the *_directory() to return calculated value all callers of them
must be patched to call something different (e.g. *_directory.get() ?)
which will lead to more confusion and errors.

Changes v3:
 - the option is --workdir back again
 - the existing *directory are only affected if unset
 - default config doesn't have any of these set
 - added the short -W alias

Changes v2:
 - the option is --home now
 - all other paths are changed to be relative

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191119130059.18066-1-xemul@scylladb.com>
2019-11-21 15:07:39 +02:00
Rafael Ávila de Espíndola
5417c5356b types: Move get_castas_fctn to cql3
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191120181213.111758-9-espindola@scylladb.com>
2019-11-21 12:08:50 +02:00
Rafael Ávila de Espíndola
f06d6df4df types: Simplify casts to string
These now just use the to_string member functions, which makes it
possible to move the code to another file.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191120181213.111758-8-espindola@scylladb.com>
2019-11-21 12:08:50 +02:00
Rafael Ávila de Espíndola
786b1ec364 types: Move json code to its own file
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191120181213.111758-7-espindola@scylladb.com>
2019-11-21 12:08:49 +02:00
Rafael Ávila de Espíndola
af8e207491 types: Avoid using deserialize_value in json code
This makes it independent of internal functions and makes it possible
to move it to another file.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191120181213.111758-6-espindola@scylladb.com>
2019-11-21 12:08:49 +02:00
Rafael Ávila de Espíndola
ed65e2c848 types: Move cql3_kind to the cql3 directory
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191120181213.111758-5-espindola@scylladb.com>
2019-11-21 12:08:47 +02:00
Rafael Ávila de Espíndola
bd560e5520 types: Fix dynamic types of some data_value objects
I found these mismatched types while converting some member functions
to standalone functions, since they have to use the public API that
has more type checks.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191120181213.111758-4-espindola@scylladb.com>
2019-11-21 12:08:46 +02:00
Rafael Ávila de Espíndola
0d953d8a35 types: Add a test for value_cast
We had no tests on when value_cast throws or when it moves the value.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191120181213.111758-2-espindola@scylladb.com>
2019-11-21 12:08:45 +02:00
Konstantin Osipov
002ff51053 lua: make sure the latest master builds on Debian/Ubuntu
Use pkg-config to search for Lua dependencies rather
than hard-code include and link paths.

Avoid using boost internals, not present in earlier
versions of boost.

Reviewed-by: Rafael Avila de Espindola <espindola@scylladb.com>
Message-Id: <20191120170005.49649-1-kostja@scylladb.com>
2019-11-21 07:57:12 +02:00
Pavel Solodovnikov
d910899d61 configure.py: support multi-threaded linking via gold
Use `-Wl,--threads` flag to enable multi-threaded linking when
using `ld.gold` linker.

Additional compilation test is required because it depends on whether
or not the `gold` linker has been compiled with `--enable-threads` option.

This patch introduces a substantial improvement to the link times of
`scylla` binary in release and debug modes (around 30 percent).

Local setup reports the following numbers with release build for
linking only build/release/scylla:

Single-threaded mode:
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:09.30
Multi-threaded mode:
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:51.57

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20191120163922.21462-1-pa.solodovnikov@scylladb.com>
2019-11-20 19:28:00 +02:00
Nadav Har'El
89d6d668cb Merge "Redis API in Scylla"
Merged patch series from Peng Jian, adding optionally-enabled Redis API
support to Scylla. This feature is experimental, and partial - the extent
of this support is detailed in docs/redis/redis.md.

Patches:
   Document: add docs/redis/redis.md
   redis: Redis API in Scylla
   Redis API: graft redis module to Scylla
   redis-test: add test cases for Redis API
2019-11-20 16:59:13 +02:00
Piotr Sarna
086e744f8f scripts/find-maintainer: refresh maintainers list
This commit attempts to make the maintainers list up-to-date
to the best of my knowledge, because it got really stale over the time.

Message-Id: <eab6d3f481712907eb83e91ed2b8dbfa0872155f.1574261533.git.sarna@scylladb.com>
2019-11-20 16:56:31 +02:00
Glauber Costa
73aff1fc95 api: export system uptime via REST
This will be useful for tools like nodetool that want to query the uptime
of the system.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190619110850.14206-1-glauber@scylladb.com>
2019-11-20 16:44:11 +02:00
Tomasz Grabiec
9a686ac551 Merge "scylla-gdb: active sstables: support k_l/mc sstable readers" from Benny
Fixes #5277
2019-11-19 23:49:39 +01:00
Avi Kivity
1164ff5329 tools: toolchain: update to Fedora 31
This is a minor update as gcc and boost versions do not change.

glibc-langpack-en no longer gets pulled in by default. As it is required
by some locale use somewhere, it is added to the explicit dependencies.
2019-11-20 00:08:30 +02:00
Avi Kivity
301c835cbf build: force xz compression on rpm binary payload
Fedora 31 switched the default compression to zstd, which isn't readable
by some older rpm distributions (CentOS 7 in particular). Tell it to use
the older xz compression instead, so packages produced on Fedora 31 can
be installed on older distributions.
2019-11-20 00:08:24 +02:00
Avi Kivity
3ebd68ef8a dist: rpm: correct systemd post-uninstall scriptlet
The post-uninstall scriptlet requires a parameter, but older versions
of rpm survived without it. Fedora 31's rpm is more strict, so supply
this parameter.
2019-11-20 00:03:49 +02:00
Peng Jian
e6adddd8ef redis-test: add test cases for Redis API
Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>
Signed-off-by: Amos Kong <amos@scylladb.com>
2019-11-20 04:56:16 +08:00
Peng Jian
f2801feb66 Redis API: graft redis module to Scylla
In this document, the detailed design and implementation of Redis API in
Scylla is provided.

v2: build: work around ragel 7 generated code bug (suggested by Avi)
    Ragel 7 incorrectly emits some unused variables that don't compile.
    As a workaround, sed them away.

Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>
Signed-off-by: Amos Kong <amos@scylladb.com>
2019-11-20 04:55:58 +08:00
Peng Jian
0737d9e84d redis: Redis API in Scylla
Scylla has advantage and amazing features. If Redis build on the top of Scylla,
it has the above features automatically. It's achived great progress
in cluster master managment, data persistence, failover and replication.

The benefits to the users are easy to use and develop in their production
environment, and taking avantages of Scylla.

Using the Ragel to parse the Redis request, server abtains the command name
and the parameters from the request, invokes the Scylla's internal API to
read and write the data, then replies to client.

Signed-off-by: Peng Jian, <pengjian.uestc@gmail.com>
2019-11-20 04:55:56 +08:00
Peng Jian
708a42c284 Document: add docs/redis/redis.md
In this document, the detailed design and implementation of Redis API in
Scylla is provided.

Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>
2019-11-20 04:46:33 +08:00
Nadav Har'El
9b9609c65b merge: row_marker: correct row expiry condition
Merged patch set by Piotr Dulikowski:

This change corrects condition on which a row was considered expired by its
TTL.

The logic that decides when a row becomes expired was inconsistent with the
logic that decides if a single cell is expired. A single cell becomes expired
when expiry_timestamp <= now, while a row became expired when
expiry_timestamp < now (notice the strict inequality). For rows inserted
with TTL, this caused non-key cells to expire (change their values to null)
one second before the row disappeared. Now, row expiry logic uses non-strict
inequality.

Fixes #4263,
Fixes #5290.

Tests:

    unit(dev)
    python test described in issue #5290
2019-11-19 18:14:15 +02:00
Amnon Heiman
9df10e2d4b scylla_util.py: Add optional timeout to out function
It is useful to have an option to limit the execution time of a shell
script.

This patch adds an optional timeout parameter, if a parameter will be
provided a command will return and failure if the duration is passed.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2019-11-19 17:30:28 +02:00
Nadav Har'El
b38c3f1288 Merge "Add separate counters for accesses to system tables"
Merged patch series from Juliusz Stasiewicz:

Welcome to my first PR to Scylla!
The task was intended as a warm-up ("noob") exercise; its description is
here: #4182 Sorry, I also couldn't help it and did some scouting: edited
descriptions of some metrics and shortened few annoyingly long LoC.
2019-11-19 15:21:56 +02:00
Piotr Dulikowski
9be842d3d8 row_marker: tests for row expiration 2019-11-19 13:45:30 +01:00
Tomasz Grabiec
5e4abd75cc main: Abort on EBADF and ENOTSOCK by default
Those are typically symptoms of use-after-free or memory corruption in
the program. It's better to catch such error sooner than later.

That situation is also dangerous since if a valid descriptor would
land under the invalid access, not the one which was intended for the
operation, then the operation may be performed on the wrong file and
result in corruption.

Message-Id: <1565206788-31254-1-git-send-email-tgrabiec@scylladb.com>
2019-11-19 13:07:33 +02:00
Piotr Dulikowski
589313a110 row_marker: correct expiration condition
This change corrects condition on which a row was considered expired by
its TTL.

The logic that decides when a row becomes expired was inconsistent with
the logic that decides if a single cell is expired. A single cell
becomes expired when `expiry_timestamp <= now`, while a row became
expired when `expiry_timestamp < now` (notice the strict inequality).
For rows inserted with TTL, this caused non-key cells to expire (change
their values to null) one second before the row disappeared. Now, row
expiry logic uses non-strict inequality.

Fixes: #4263, #5290.

Tests:
- unit(dev)
- python test described in issue #5290
2019-11-19 11:46:59 +01:00
Pekka Enberg
505f2c1008 test.py: Append test repeat cycle to output XML filename
Currently, we overwrite the same XML output file for each test repeat
cycle. This can cause invalid XML to be generated if the XML contents
don't match exactly for every iteration.

Fix the problem by appending the test repeat cycle in the XML filename
as follows:

  $ ./test.py --repeat 3 --name vint_serialization_test --mode dev --jenkins jenkins_test

  $ ls -1 *.xml
  jenkins_test.release.vint_serialization_test.0.boost.xml
  jenkins_test.release.vint_serialization_test.1.boost.xml
  jenkins_test.release.vint_serialization_test.2.boost.xml


Fixes #5303.

Message-Id: <20191119092048.16419-1-penberg@scylladb.com>
2019-11-19 11:30:47 +02:00
Rafael Ávila de Espíndola
750adee6e3 lua: fix build with boost 1.67 and older vs fmt
It is not completely clear why the fmt base code fails with boost
1.67, but it is easy to avoid.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191118210540.129603-1-espindola@scylladb.com>
2019-11-19 11:14:00 +02:00
Tomasz Grabiec
ff567649fa Merge "gossip: Limit number of pending gossip ACK and ACK2 messages" from Asias
In a cross-dc large cluster, the receiver node of the gossip SYN message
might be slow to send the gossip ACK message. The ack messages can be
large if the payload of the application state is big, e.g.,
CACHE_HITRATES with a lot of tables. As a result, the unlimited ACK
message can consume unlimited amount of memory which causes OOM
eventually.

To fix, this patch queues the SYN message and handles it later if the
previous ACK message is still being sent. However, we only store the
latest SYN message. Since the latest SYN message from peer has the
latest information, so it is safe to drop the previous SYN message and
keep the latest one only. After this patch, there can be at most 1
pending SYN message and 1 pending ACK message per peer node.
2019-11-18 10:52:38 +01:00
Benny Halevy
f9e93bba38 sstables: compaction: move cleanup parameter to compaction_descriptor
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191117165806.3234-1-bhalevy@scylladb.com>
2019-11-18 10:52:20 +01:00
Avi Kivity
1fe062aed4 Merge "Add basic UDF support" from Rafael
"

This patch series adds only UDF support, UDA will be in the next patch series.

With this all CQL types are mapped to Lua. Right now we setup a new
lua state and copy the values for each argument and return. This will
be optimized once profiled.

We require --experimental to enable UDF in case there is some change
to the table format.
"

* 'espindola/udf-only-v4' of https://github.com/espindola/scylla: (65 commits)
  Lua: Document the conversions between Lua and CQL
  Lua: Implement decimal subtraction
  Lua: Implement decimal addition
  Lua: Implement support for returning decimal
  Lua: Implement decimal to string conversion
  Lua: Implement decimal to floating point conversion
  Lua: Implement support for decimal arguments
  Lua: Implement support for returning varint
  Lua: Implement support for returning duration
  Lua: Implement support for duration arguments
  Lua: Implement support for returning inet
  Lua: Implement support for inet arguments
  Lua: Implement support for returning time
  Lua: Implement support for time arguments
  Lua: Implement support for returning timeuuid
  Lua: Implement support for returning uuid
  Lua: Implement support for uuid and timeuuid arguments
  Lua: Implement support for returning date
  Lua: Implement support for date arguments
  Lua: Implement support for returning timestamp
  ...
2019-11-17 16:38:19 +02:00
Konstantin Osipov
48f3ca0fcb test.py: use the configured build modes from ninja mode_list
Add mode_list rule to ninja build and use it by default when searching
for tests in test.py.

Now it is no longer necessary to explicitly specify the test mode when
invoking test.py.

(cherry picked from commit a211ff30c7f2de12166d8f6f10d259207b462d4b)
2019-11-17 13:42:10 +01:00
Nadav Har'El
2fb2eb27a2 sstables: allow non-traditional characters in table name
The goal of this patch is to fix issue #5280, a rather serious Alternator
bug, where Scylla fails to restart when an Alternator table has secondary
indexes (LSI or GSI).

Traditionally, Cassandra allows table names to contain only alphanumeric
characters and underscores. However, most of our internal implementation
doesn't actually have this restriction. So Alternator uses the characters
':' and '!' in the table names to mark global and local secondary indexes,
respectively. And this actually works. Or almost...

This patch fixes a problem of listing, during boot, the sstables stored
for tables with such non-traditional names. The sstable listing code
needlessly assumes that the *directory* name, i.e., the CF names, matches
the "\w+" regular expression. When an sstable is found in a directory not
matching such regular expression, the boot fails. But there is no real
reason to require such a strict regular expression. So this patch relaxes
this requirement, and allows Scylla to boot with Alternator's GSI and LSI
tables and their names which include the ":" and "!" characters, and in
fact any other name allowed as a directory name.

Fixes #5280.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191114153811.17386-1-nyh@scylladb.com>
2019-11-17 14:27:47 +02:00
Shlomi Livne
3e873812a4 Document backport queue and procedure (#5282)
This document adds information about how fixes are tracked to be
backported into releases and what is the procedure that is followed to
backport those fixes.

Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2019-11-17 01:45:24 -08:00
Benny Halevy
c215ad79a9 scylla-gdb: resolve: add startswith parameter
Allow filtering the resolved addresses by a startswith string.

The common use case if for resolving vtable ptrs, when resolving
the output of `find_vptrs` that may be too long for the host
(running gdb) memory size. In this case the number of vtable
ptrs is considerably smaller than the total number of objects
returned by find_ptrs (e.g. 462 vs. 69625 in a OOM core I
examined from scylla --smp=2 --memory=1024M)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-11-17 11:40:54 +02:00
Benny Halevy
2f688dcf08 scylla-gdb.py: find_single_sstable_readers: fix support for sstable_mutation_reader
provide template arguments for k_l and m readers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-11-17 11:02:05 +02:00
Kamil Braun
a67e887dea sstables: fix sstable file I/O CQL tracing when reading multiple files (#5285)
CQL tracing would only report file I/O involving one sstable, even if
multiple sstables were read from during the query.

Steps to reproduce:

create a table with NullCompactionStrategy
insert row, flush memtables
insert row, flush memtables
restart Scylla
tracing on
select * from table
The trace would only report DMA reads from one of the two sstables.

Kudos to @denesb for catching this.

Related issue: #4908
2019-11-17 00:38:37 -08:00
Tomasz Grabiec
a384d0af76 Merge "A set of cleanups over main() code" from Pavel E.
There are ... signs of massive start/stop code rework in the
main() function. While fixing the sub-modules interdependencies
during start/stop I've polished these signs too, so here's the
simplest ones.
2019-11-15 15:25:18 +01:00
Pavel Emelyanov
1dc490c81c tracing: Move register_tracing_keyspace_backend forward decl into proper header
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-11-14 19:59:03 +03:00
Pavel Emelyanov
7e81df71ba main: Shorten developer_mode() evaluation
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-11-14 19:59:03 +03:00
Pavel Emelyanov
1bd68d87fc main: Do not carry pctx all over the code
v2:
- do not use struct initialization extention

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-11-14 19:59:03 +03:00
Pavel Emelyanov
655b6d0d1e main: Hide start_thrift
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-11-14 19:59:03 +03:00
Pavel Emelyanov
26f2b2ce5e main,db: Kill some unused .hh includes
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-11-14 19:59:03 +03:00
Pavel Emelyanov
f5b345604f main: Factor out get_conf_sub
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-11-14 19:59:03 +03:00
Pavel Emelyanov
924d52573d main: Remove unused return_value variable (and capture)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2019-11-14 19:59:03 +03:00
Juliusz Stasiewicz
1cfa458409 metrics: separate counters for `system' KS accesses
Resolves #4182. Metrics per system tables are accumulated separately,
depending on the origin of query (DB internals vs clients).
2019-11-14 13:14:39 +01:00
Juliusz Stasiewicz
b1e4d222ed cql3: cosmetics - improved description of metrics 2019-11-14 10:35:42 +01:00
Rafael Ávila de Espíndola
10bcbaf348 Lua: Document the conversions between Lua and CQL
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
6ffddeae5e Lua: Implement decimal subtraction
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
aba8e531d1 Lua: Implement decimal addition
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
bb84eabbb3 Lua: Implement support for returning decimal
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
bc17312a86 Lua: Implement decimal to string conversion
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
e83d5bf375 Lua: Implement decimal to floating point conversion
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
b568bf4f54 Lua: Implement support for decimal arguments
This is just the minimum to pass a value to Lua. Right now you can't
actually do anything with it.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
6c3f050eb4 Lua: Implement support for returning varint
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
dc377abd68 Lua: Implement support for returning duration
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
c3f021d2e4 Lua: Implement support for duration arguments
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
9208b2f498 Lua: Implement support for returning inet
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
64be94ab01 Lua: Implement support for inet arguments
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
faf029d472 Lua: Implement support for returning time
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
772f2a4982 Lua: Implement support for time arguments
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
484f498534 Lua: Implement support for returning timeuuid
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
9c2daf6554 Lua: Implement support for returning uuid
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
ae1a1a4085 Lua: Implement support for uuid and timeuuid arguments
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
f8aeed5beb Lua: Implement support for returning date
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
384effa54b Lua: Implement support for date arguments
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
63bc960152 Lua: Implement support for returning timestamp
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
ee95756f62 Lua: Implement support for timestamp arguments
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
1c6d5507b4 Lua: Implement support for returning counter
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
0d9d53b5da Lua: Implement support for counter arguments
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
74c4e58b6b Lua: Add a test for nested types.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
b226511ce8 Lua: Implement support for returning maps
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
5c8d1a797f Lua: Implement support for map arguments
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
b5b15ce4e6 Lua: Implement support for returning set
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
cf7ba441e4 Lua: Implement support for set arguments
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
02f076be43 Lua: Implement support for returning udt
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
92c8e94d9a Lua: Implement support for udt arguments
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
a7c3f6f297 Lua: Implement support for returning list
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
688736f5ff Lua: Implement support for returning tuple
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
ab5708a711 Lua: Implement support for list and tuple arguments
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
534f29172c Lua: Implement support for returning boolean
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
b03c580493 Lua: Implement support for boolean arguments
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
dcfe397eb6 Lua: Implement support for returning floating point
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
cf4b7ab39a Lua: Implement support for returning blob
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
3d22433cd4 Lua: Implement support for blob arguments
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
dd754fcf01 Lua: Implement support for returning ascii
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
affb1f8efd Lua: Implement support for returning text
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
f8ed347ee7 Lua: Implement support for string arguments
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
0e4f047113 Lua: Implement a visitor for return values
This adds support for all integer types. Followup commits will
implement the missing types.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
34b770e2fb Lua: Push varint as decimal
This makes it substantially simpler to support both varint and
decimal, which will be implemented in a followup patch.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
9b3cab8865 Lua: Implement support for varint to integer conversion
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
5a40264d97 Lua: Implement support for varint arguments
Right now it is not possible to do anything with the value.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
3230b8bd86 Lua: Implement support for floating point arguments
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
9ad2cc2850 Lua: Implement a visitor for arguments
With this we support all simple integer types. Followup patches will
implement the missing types.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
ee1d87a600 Lua: Plug in the interpreter
This add a wrapper around the lua interpreter so that function
executions are interruptible and return futures.

With this patch it is possible to write and use simple UDFs that take
and return integer values.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
bc3bba1064 Lua: Add lua.cc and lua.hh skeleton files
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
7015e219ca Lua: Link with liblua
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
61200ebb04 Lua: Add config options
This patch just adds the config options that we will expose for the
lua runtime.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
d9337152f3 Use threads when executing user functions
This adds a requires_thread predicate to functions and propagates that
up until we get to code that already returns futures.

We can then use the predicate to decide if we need to use
seastar::async.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
52b48b415c Test that schema digests with UDFs don't change
This refactors test_schema_digest_does_not_change to also test a
schema with user defined functions and user defined aggregates.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
fc72a64c67 Add schema propagation and storage for UDF
With this it is possible to create user defined functions and
aggregates and they are saved to disk and the schema change is
propagated.

It is just not possible to call them yet.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
ce6304d920 UDF: Add a feature and config option to track if udf is enabled
It can only be enabled with --experimental.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:40:47 -08:00
Rafael Ávila de Espíndola
dd17dfcbef Reject "OR REPLACE ... IF NOT EXISTS" in the grammar
The parser now rejects having both OR REPLACE and IF NOT EXISTS in the
same statement.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:19:52 -08:00
Rafael Ávila de Espíndola
e7e3dab4aa Convert UDF parsing code to c++
For now this just constructs the corresponding c++ classes.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:19:52 -08:00
Rafael Ávila de Espíndola
5c45f3b573 Update UDF syntax
This updates UDF syntax to the current specification.

In particular, this removes DETERMINISTIC and adds "CALLED ON NULL
INPUT" and "RETURNS NULL ON NULL INPUT".

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:19:52 -08:00
Rafael Ávila de Espíndola
c75cd5989c transport: Add support for FUNCTION and AGGREGATE to schema_change
While at it, modernize the code a bit and add a test.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:19:52 -08:00
Rafael Ávila de Espíndola
dac3cf5059 Clear functions between cql_test_env runs
At some point we should make the function list non static, but this
allows us to write tests for now.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:19:52 -08:00
Rafael Ávila de Espíndola
de1a970b93 cql: convert functions to add, remove and replace functions
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:19:52 -08:00
Rafael Ávila de Espíndola
33f9d196f9 Add iterator version of functions::find
This avoids allocating a std::vector and is more flexible since the
iterator can be passed to erase.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:19:52 -08:00
Rafael Ávila de Espíndola
7f9dadee5c Implement functions::type_equals.
Since the types are uniqued we can just use ==.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:19:52 -08:00
Rafael Ávila de Espíndola
5cef5a1b38 types: Add a friend visitor over data_value
This is a simple wrapper that allows code that is not in the types
hierarchy to visit a data_value.

Will be used by UDF.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:19:52 -08:00
Rafael Ávila de Espíndola
9bf9a84e4d types: Move the data_value visitor to a header
It will be used by the UDF implementation.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:19:52 -08:00
Asias He
f32ae00510 gossip: Limit number of pending gossip ACK2 messages
Similar to "gossip: Limit number of pending gossip ACK messages", limit
the number of pending gossip ACK2 messages in gossiper::handle_ack_msg.

Fixes #5210
2019-10-25 12:44:28 +08:00
Asias He
15148182ab gossip: Limit number of pending gossip ACK messages
In a cross-dc large cluster, the receiver node of the gossip SYN message
might be slow to send the gossip ACK message. The ack messages can be
large if the payload of the application state is big, e.g.,
CACHE_HITRATES with a lot of tables. As a result, the unlimited ACK
message can consume unlimited amount of memory which causes OOM
eventually.

To fix, this patch queues the SYN message and handles it later if the
previous ACK message is still being sent. However, we only store the
latest SYN message. Since the latest SYN message from peer has the
latest information, so it is safe to drop the previous SYN message and
keep the latest one only. After this patch, there can be at most 1
pending SYN message and 1 pending ACK message per peer node.

Fixes #5210
2019-10-25 12:44:28 +08:00
Benny Halevy
7827e3f11d tests: test_large_data: do not stop database
Now that compaction returns only after the compacted sstables are
deleted we no longer need to stop the base to force waiting
for deletes (that were previously done asynchronously)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-09-02 12:15:38 +03:00
Benny Halevy
19b67d82c9 table::on_compaction_completion: fix indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-09-02 12:15:38 +03:00
Benny Halevy
8dd6e13468 table::on_compaction_completion: wait for background deletes
Don't let background deletes accumulate uncontrollably.

Fixes #4909

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-09-02 12:15:38 +03:00
Benny Halevy
da6645dc2c table: refresh_snapshot before deleting any sstables
The row cache must not hold refrences to any sstable we're
about to delete.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-09-02 12:15:29 +03:00
4964 changed files with 68715 additions and 30666 deletions

View File

@@ -1,3 +1,4 @@
.git
build
seastar/build
testlog

2
.gitignore vendored
View File

@@ -22,3 +22,5 @@ resources
.pytest_cache
/expressions.tokens
tags
testlog/*
test/*/*.reject

3
.gitmodules vendored
View File

@@ -15,3 +15,6 @@
[submodule "zstd"]
path = zstd
url = ../zstd
[submodule "abseil"]
path = abseil
url = ../abseil-cpp

View File

@@ -5,13 +5,25 @@
cmake_minimum_required(VERSION 3.7)
project(scylla)
if(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
message(STATUS "Setting build type to 'Release' as none was specified.")
set(CMAKE_BUILD_TYPE "Release" CACHE
STRING "Choose the type of build." FORCE)
# Set the possible values of build type for cmake-gui
set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS
"Debug" "Release" "Dev" "Sanitize")
endif()
if(CMAKE_BUILD_TYPE)
string(TOLOWER "${CMAKE_BUILD_TYPE}" BUILD_TYPE)
else()
set(BUILD_TYPE "release")
endif()
if (NOT DEFINED FOR_IDE AND NOT DEFINED ENV{FOR_IDE} AND NOT DEFINED ENV{CLION_IDE})
message(FATAL_ERROR "This CMakeLists.txt file is only valid for use in IDEs, please define FOR_IDE to acknowledge this.")
endif()
# Default value. A more accurate list is populated through `pkg-config` below if `seastar.pc` is available.
set(SEASTAR_INCLUDE_DIRS "seastar")
# These paths are always available, since they're included in the repository. Additional DPDK headers are placed while
# Seastar is built, and are captured in `SEASTAR_INCLUDE_DIRS` through parsing the Seastar pkg-config file (below).
set(SEASTAR_DPDK_INCLUDE_DIRS
@@ -22,9 +34,14 @@ set(SEASTAR_DPDK_INCLUDE_DIRS
find_package(PkgConfig REQUIRED)
set(ENV{PKG_CONFIG_PATH} "${CMAKE_SOURCE_DIR}/seastar/build/release:$ENV{PKG_CONFIG_PATH}")
set(ENV{PKG_CONFIG_PATH} "${CMAKE_SOURCE_DIR}/build/${BUILD_TYPE}/seastar:$ENV{PKG_CONFIG_PATH}")
pkg_check_modules(SEASTAR seastar)
if(NOT SEASTAR_INCLUDE_DIRS)
# Default value. A more accurate list is populated through `pkg-config` below if `seastar.pc` is available.
set(SEASTAR_INCLUDE_DIRS "seastar/include")
endif()
find_package(Boost COMPONENTS filesystem program_options system thread)
##
@@ -70,7 +87,7 @@ scan_scylla_source_directories(
seastar/json
seastar/net
seastar/rpc
seastar/tests
seastar/testing
seastar/util)
scan_scylla_source_directories(
@@ -97,7 +114,7 @@ scan_scylla_source_directories(
service
sstables
streaming
tests
test
thrift
tracing
transport
@@ -106,7 +123,7 @@ scan_scylla_source_directories(
scan_scylla_source_directories(
VAR SCYLLA_GEN_SOURCE_FILES
RECURSIVE
PATHS build/release/gen)
PATHS build/${BUILD_TYPE}/gen)
set(SCYLLA_SOURCE_FILES
${SCYLLA_ROOT_SOURCE_FILES}
@@ -139,4 +156,4 @@ target_include_directories(scylla PUBLIC
${Boost_INCLUDE_DIRS}
xxhash
libdeflate
build/release/gen)
build/${BUILD_TYPE}/gen)

View File

@@ -141,7 +141,7 @@ In v3:
"Tests: unit ({mode}), dtest ({smp})"
```
The usual is "Tests: unit (release)", although running debug tests is encouraged.
The usual is "Tests: unit (dev)", although running debug tests is encouraged.
5. When answering review comments, prefer inline quotes as they make it easier to track the conversation across multiple e-mails.

View File

@@ -5,8 +5,6 @@ F: Filename, directory, or pattern for the subsystem
---
AUTH
M: Paweł Dziepak <pdziepak@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Calle Wilund <calle@scylladb.com>
R: Vlad Zolotarov <vladz@scylladb.com>
R: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
@@ -14,22 +12,17 @@ F: auth/*
CACHE
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Paweł Dziepak <pdziepak@scylladb.com>
R: Piotr Jastrzebski <piotr@scylladb.com>
F: row_cache*
F: *mutation*
F: tests/mvcc*
COMMITLOG / BATCHLOGa
M: Paweł Dziepak <pdziepak@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Calle Wilund <calle@scylladb.com>
F: db/commitlog/*
F: db/batch*
COORDINATOR
M: Paweł Dziepak <pdziepak@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Gleb Natapov <gleb@scylladb.com>
F: service/storage_proxy*
@@ -49,12 +42,10 @@ M: Pekka Enberg <penberg@scylladb.com>
F: cql3/*
COUNTERS
M: Paweł Dziepak <pdziepak@scylladb.com>
F: counters*
F: tests/counter_test*
GOSSIP
M: Duarte Nunes <duarte@scylladb.com>
M: Tomasz Grabiec <tgrabiec@scylladb.com>
R: Asias He <asias@scylladb.com>
F: gms/*
@@ -65,14 +56,11 @@ F: dist/docker/*
LSA
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Paweł Dziepak <pdziepak@scylladb.com>
F: utils/logalloc*
MATERIALIZED VIEWS
M: Duarte Nunes <duarte@scylladb.com>
M: Pekka Enberg <penberg@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
R: Duarte Nunes <duarte@scylladb.com>
M: Nadav Har'El <nyh@scylladb.com>
F: db/view/*
F: cql3/statements/*view*
@@ -82,14 +70,12 @@ F: dist/*
REPAIR
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Asias He <asias@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
F: repair/*
SCHEMA MANAGEMENT
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
M: Pekka Enberg <penberg@scylladb.com>
F: db/schema_tables*
F: db/legacy_schema_migrator*
@@ -98,15 +84,13 @@ F: schema*
SECONDARY INDEXES
M: Pekka Enberg <penberg@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
M: Nadav Har'El <nyh@scylladb.com>
R: Pekka Enberg <penberg@scylladb.com>
F: db/index/*
F: cql3/statements/*index*
SSTABLES
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Raphael S. Carvalho <raphaelsc@scylladb.com>
R: Glauber Costa <glauber@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
@@ -114,18 +98,17 @@ F: sstables/*
STREAMING
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Asias He <asias@scylladb.com>
F: streaming/*
F: service/storage_service.*
THRIFT TRANSPORT LAYER
M: Duarte Nunes <duarte@scylladb.com>
F: thrift/*
ALTERNATOR
M: Nadav Har'El <nyh@scylladb.com>
F: alternator/*
F: alternator-test/*
THE REST
M: Avi Kivity <avi@scylladb.com>
M: Paweł Dziepak <pdziepak@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Nadav Har'El <nyh@scylladb.com>
F: *

View File

@@ -27,10 +27,10 @@ Please see [HACKING.md](HACKING.md) for detailed information on building and dev
```
* run Scylla with one CPU and ./tmp as data directory
* run Scylla with one CPU and ./tmp as work directory
```
./build/release/scylla --datadir tmp --commitlog-directory tmp --smp 1
./build/release/scylla --workdir tmp --smp 1
```
* For more run options:
@@ -38,6 +38,10 @@ Please see [HACKING.md](HACKING.md) for detailed information on building and dev
./build/release/scylla --help
```
## Testing
See [test.py manual](docs/testing.md).
## Scylla APIs and compatibility
By default, Scylla is compatible with Apache Cassandra and its APIs - CQL and
Thrift. There is also experimental support for the API of Amazon DynamoDB,
@@ -56,31 +60,12 @@ both.
Seastar documentation can be found [here](http://docs.seastar.io/master/index.html).
User documentation can be found [here](https://docs.scylladb.com/).
## Building Fedora RPM
## Training
As a pre-requisite, you need to install [Mock](https://fedoraproject.org/wiki/Mock) on your machine:
```
# Install mock:
sudo yum install mock
# Add user to the "mock" group:
usermod -a -G mock $USER && newgrp mock
```
Then, to build an RPM, run:
```
./dist/redhat/build_rpm.sh
```
The built RPM is stored in ``/var/lib/mock/<configuration>/result`` directory.
For example, on Fedora 21 mock reports the following:
```
INFO: Done(scylla-server-0.00-1.fc21.src.rpm) Config(default) 20 minutes 7 seconds
INFO: Results and/or logs in: /var/lib/mock/fedora-21-x86_64/result
```
Training material and online courses can be found at [Scylla University](https://university.scylladb.com/).
The courses are free, self-paced and include hands-on examples. They cover a variety of topics including Scylla data modeling,
administration, architecture, basic NoSQL concepts, using drivers for application development, Scylla setup, failover, compactions,
multi-datacenters and how Scylla integrates with third-party applications.
## Building Fedora-based Docker image

View File

@@ -1,7 +1,7 @@
#!/bin/sh
PRODUCT=scylla
VERSION=3.2.5
VERSION=4.0.11
if test -f version
then
@@ -19,6 +19,14 @@ else
SCYLLA_RELEASE=$SCYLLA_BUILD.$DATE.$GIT_COMMIT
fi
if [ -f build/SCYLLA-RELEASE-FILE ]; then
RELEASE_FILE=$(cat build/SCYLLA-RELEASE-FILE)
GIT_COMMIT_FILE=$(cat build/SCYLLA-RELEASE-FILE |cut -d . -f 3)
if [ "$GIT_COMMIT" = "$GIT_COMMIT_FILE" ]; then
exit 0
fi
fi
echo "$SCYLLA_VERSION-$SCYLLA_RELEASE"
mkdir -p build
echo "$SCYLLA_VERSION" > build/SCYLLA-VERSION-FILE

1
abseil Submodule

Submodule abseil added at 2069dc796a

View File

@@ -1,40 +0,0 @@
# Copyright 2019 ScyllaDB
#
# This file is part of Scylla.
#
# Scylla is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Scylla is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with Scylla. If not, see <http://www.gnu.org/licenses/>.
# Tests for the ConditionExpression parameter
import pytest
from botocore.exceptions import ClientError
from util import random_string
# Test that ConditionExpression works as expected
@pytest.mark.xfail(reason="ConditionExpression not yet implemented")
def test_update_condition_expression(test_table_s):
p = random_string()
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET b = :val1',
ExpressionAttributeValues={':val1': 4})
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET b = :val1',
ConditionExpression='b = :oldval',
ExpressionAttributeValues={':val1': 6, ':oldval': 4})
with pytest.raises(ClientError, match='ConditionalCheckFailedException.*'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET b = :val1',
ConditionExpression='b = :oldval',
ExpressionAttributeValues={':val1': 8, ':oldval': 4})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'b': 6}

View File

@@ -1,402 +0,0 @@
# Copyright 2019 ScyllaDB
#
# This file is part of Scylla.
#
# Scylla is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Scylla is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with Scylla. If not, see <http://www.gnu.org/licenses/>.
# Tests for the CRUD item operations: PutItem, GetItem, UpdateItem, DeleteItem
import pytest
from botocore.exceptions import ClientError
from decimal import Decimal
from util import random_string, random_bytes
# Basic test for creating a new item with a random name, and reading it back
# with strong consistency.
# Only the string type is used for keys and attributes. None of the various
# optional PutItem features (Expected, ReturnValues, ReturnConsumedCapacity,
# ReturnItemCollectionMetrics, ConditionalOperator, ConditionExpression,
# ExpressionAttributeNames, ExpressionAttributeValues) are used, and
# for GetItem strong consistency is requested as well as all attributes,
# but no other optional features (AttributesToGet, ReturnConsumedCapacity,
# ProjectionExpression, ExpressionAttributeNames)
def test_basic_string_put_and_get(test_table):
p = random_string()
c = random_string()
val = random_string()
val2 = random_string()
test_table.put_item(Item={'p': p, 'c': c, 'attribute': val, 'another': val2})
item = test_table.get_item(Key={'p': p, 'c': c}, ConsistentRead=True)['Item']
assert item['p'] == p
assert item['c'] == c
assert item['attribute'] == val
assert item['another'] == val2
# Similar to test_basic_string_put_and_get, just uses UpdateItem instead of
# PutItem. Because the item does not yet exist, it should work the same.
def test_basic_string_update_and_get(test_table):
p = random_string()
c = random_string()
val = random_string()
val2 = random_string()
test_table.update_item(Key={'p': p, 'c': c}, AttributeUpdates={'attribute': {'Value': val, 'Action': 'PUT'}, 'another': {'Value': val2, 'Action': 'PUT'}})
item = test_table.get_item(Key={'p': p, 'c': c}, ConsistentRead=True)['Item']
assert item['p'] == p
assert item['c'] == c
assert item['attribute'] == val
assert item['another'] == val2
# Test put_item and get_item of various types for the *attributes*,
# including both scalars as well as nested documents, lists and sets.
# The full list of types tested here:
# number, boolean, bytes, null, list, map, string set, number set,
# binary set.
# The keys are still strings.
# Note that only top-level attributes are written and read in this test -
# this test does not attempt to modify *nested* attributes.
# See https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/dynamodb.html
# on how to pass these various types to Boto3's put_item().
def test_put_and_get_attribute_types(test_table):
key = {'p': random_string(), 'c': random_string()}
test_items = [
Decimal("12.345"),
42,
True,
False,
b'xyz',
None,
['hello', 'world', 42],
{'hello': 'world', 'life': 42},
{'hello': {'test': 'hi', 'hello': True, 'list': [1, 2, 'hi']}},
set(['hello', 'world', 'hi']),
set([1, 42, Decimal("3.14")]),
set([b'xyz', b'hi']),
]
item = { str(i) : test_items[i] for i in range(len(test_items)) }
item.update(key)
test_table.put_item(Item=item)
got_item = test_table.get_item(Key=key, ConsistentRead=True)['Item']
assert item == got_item
# The test_empty_* tests below verify support for empty items, with no
# attributes except the key. This is a difficult case for Scylla, because
# for an empty row to exist, Scylla needs to add a "CQL row marker".
# There are several ways to create empty items - via PutItem, UpdateItem
# and deleting attributes from non-empty items, and we need to check them
# all, in several test_empty_* tests:
def test_empty_put(test_table):
p = random_string()
c = random_string()
test_table.put_item(Item={'p': p, 'c': c})
item = test_table.get_item(Key={'p': p, 'c': c}, ConsistentRead=True)['Item']
assert item == {'p': p, 'c': c}
def test_empty_put_delete(test_table):
p = random_string()
c = random_string()
test_table.put_item(Item={'p': p, 'c': c, 'hello': 'world'})
test_table.update_item(Key={'p': p, 'c': c}, AttributeUpdates={'hello': {'Action': 'DELETE'}})
item = test_table.get_item(Key={'p': p, 'c': c}, ConsistentRead=True)['Item']
assert item == {'p': p, 'c': c}
def test_empty_update(test_table):
p = random_string()
c = random_string()
test_table.update_item(Key={'p': p, 'c': c}, AttributeUpdates={})
item = test_table.get_item(Key={'p': p, 'c': c}, ConsistentRead=True)['Item']
assert item == {'p': p, 'c': c}
def test_empty_update_delete(test_table):
p = random_string()
c = random_string()
test_table.update_item(Key={'p': p, 'c': c}, AttributeUpdates={'hello': {'Value': 'world', 'Action': 'PUT'}})
test_table.update_item(Key={'p': p, 'c': c}, AttributeUpdates={'hello': {'Action': 'DELETE'}})
item = test_table.get_item(Key={'p': p, 'c': c}, ConsistentRead=True)['Item']
assert item == {'p': p, 'c': c}
# Test error handling of UpdateItem passed a bad "Action" field.
def test_update_bad_action(test_table):
p = random_string()
c = random_string()
val = random_string()
with pytest.raises(ClientError, match='ValidationException'):
test_table.update_item(Key={'p': p, 'c': c}, AttributeUpdates={'attribute': {'Value': val, 'Action': 'NONEXISTENT'}})
# A more elaborate UpdateItem test, updating different attributes at different
# times. Includes PUT and DELETE operations.
def test_basic_string_more_update(test_table):
p = random_string()
c = random_string()
val1 = random_string()
val2 = random_string()
val3 = random_string()
val4 = random_string()
test_table.update_item(Key={'p': p, 'c': c}, AttributeUpdates={'a3': {'Value': val1, 'Action': 'PUT'}})
test_table.update_item(Key={'p': p, 'c': c}, AttributeUpdates={'a1': {'Value': val1, 'Action': 'PUT'}})
test_table.update_item(Key={'p': p, 'c': c}, AttributeUpdates={'a2': {'Value': val2, 'Action': 'PUT'}})
test_table.update_item(Key={'p': p, 'c': c}, AttributeUpdates={'a1': {'Value': val3, 'Action': 'PUT'}})
test_table.update_item(Key={'p': p, 'c': c}, AttributeUpdates={'a3': {'Action': 'DELETE'}})
item = test_table.get_item(Key={'p': p, 'c': c}, ConsistentRead=True)['Item']
assert item['p'] == p
assert item['c'] == c
assert item['a1'] == val3
assert item['a2'] == val2
assert not 'a3' in item
# Test that item operations on a non-existant table name fail with correct
# error code.
def test_item_operations_nonexistent_table(dynamodb):
with pytest.raises(ClientError, match='ResourceNotFoundException'):
dynamodb.meta.client.put_item(TableName='non_existent_table',
Item={'a':{'S':'b'}})
# Fetching a non-existant item. According to the DynamoDB doc, "If there is no
# matching item, GetItem does not return any data and there will be no Item
# element in the response."
def test_get_item_missing_item(test_table):
p = random_string()
c = random_string()
assert not "Item" in test_table.get_item(Key={'p': p, 'c': c}, ConsistentRead=True)
# Test that if we have a table with string hash and sort keys, we can't read
# or write items with other key types to it.
def test_put_item_wrong_key_type(test_table):
b = random_bytes()
s = random_string()
n = Decimal("3.14")
# Should succeed (correct key types)
test_table.put_item(Item={'p': s, 'c': s})
assert test_table.get_item(Key={'p': s, 'c': s}, ConsistentRead=True)['Item'] == {'p': s, 'c': s}
# Should fail (incorrect hash key types)
with pytest.raises(ClientError, match='ValidationException'):
test_table.put_item(Item={'p': b, 'c': s})
with pytest.raises(ClientError, match='ValidationException'):
test_table.put_item(Item={'p': n, 'c': s})
# Should fail (incorrect sort key types)
with pytest.raises(ClientError, match='ValidationException'):
test_table.put_item(Item={'p': s, 'c': b})
with pytest.raises(ClientError, match='ValidationException'):
test_table.put_item(Item={'p': s, 'c': n})
# Should fail (missing hash key)
with pytest.raises(ClientError, match='ValidationException'):
test_table.put_item(Item={'c': s})
# Should fail (missing sort key)
with pytest.raises(ClientError, match='ValidationException'):
test_table.put_item(Item={'p': s})
def test_update_item_wrong_key_type(test_table, test_table_s):
b = random_bytes()
s = random_string()
n = Decimal("3.14")
# Should succeed (correct key types)
test_table.update_item(Key={'p': s, 'c': s}, AttributeUpdates={})
assert test_table.get_item(Key={'p': s, 'c': s}, ConsistentRead=True)['Item'] == {'p': s, 'c': s}
# Should fail (incorrect hash key types)
with pytest.raises(ClientError, match='ValidationException'):
test_table.update_item(Key={'p': b, 'c': s}, AttributeUpdates={})
with pytest.raises(ClientError, match='ValidationException'):
test_table.update_item(Key={'p': n, 'c': s}, AttributeUpdates={})
# Should fail (incorrect sort key types)
with pytest.raises(ClientError, match='ValidationException'):
test_table.update_item(Key={'p': s, 'c': b}, AttributeUpdates={})
with pytest.raises(ClientError, match='ValidationException'):
test_table.update_item(Key={'p': s, 'c': n}, AttributeUpdates={})
# Should fail (missing hash key)
with pytest.raises(ClientError, match='ValidationException'):
test_table.update_item(Key={'c': s}, AttributeUpdates={})
# Should fail (missing sort key)
with pytest.raises(ClientError, match='ValidationException'):
test_table.update_item(Key={'p': s}, AttributeUpdates={})
# Should fail (spurious key columns)
with pytest.raises(ClientError, match='ValidationException'):
test_table.get_item(Key={'p': s, 'c': s, 'spurious': s})
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.get_item(Key={'p': s, 'c': s})
def test_get_item_wrong_key_type(test_table, test_table_s):
b = random_bytes()
s = random_string()
n = Decimal("3.14")
# Should succeed (correct key types) but have empty result
assert not "Item" in test_table.get_item(Key={'p': s, 'c': s}, ConsistentRead=True)
# Should fail (incorrect hash key types)
with pytest.raises(ClientError, match='ValidationException'):
test_table.get_item(Key={'p': b, 'c': s})
with pytest.raises(ClientError, match='ValidationException'):
test_table.get_item(Key={'p': n, 'c': s})
# Should fail (incorrect sort key types)
with pytest.raises(ClientError, match='ValidationException'):
test_table.get_item(Key={'p': s, 'c': b})
with pytest.raises(ClientError, match='ValidationException'):
test_table.get_item(Key={'p': s, 'c': n})
# Should fail (missing hash key)
with pytest.raises(ClientError, match='ValidationException'):
test_table.get_item(Key={'c': s})
# Should fail (missing sort key)
with pytest.raises(ClientError, match='ValidationException'):
test_table.get_item(Key={'p': s})
# Should fail (spurious key columns)
with pytest.raises(ClientError, match='ValidationException'):
test_table.get_item(Key={'p': s, 'c': s, 'spurious': s})
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.get_item(Key={'p': s, 'c': s})
def test_delete_item_wrong_key_type(test_table, test_table_s):
b = random_bytes()
s = random_string()
n = Decimal("3.14")
# Should succeed (correct key types)
test_table.delete_item(Key={'p': s, 'c': s})
# Should fail (incorrect hash key types)
with pytest.raises(ClientError, match='ValidationException'):
test_table.delete_item(Key={'p': b, 'c': s})
with pytest.raises(ClientError, match='ValidationException'):
test_table.delete_item(Key={'p': n, 'c': s})
# Should fail (incorrect sort key types)
with pytest.raises(ClientError, match='ValidationException'):
test_table.delete_item(Key={'p': s, 'c': b})
with pytest.raises(ClientError, match='ValidationException'):
test_table.delete_item(Key={'p': s, 'c': n})
# Should fail (missing hash key)
with pytest.raises(ClientError, match='ValidationException'):
test_table.delete_item(Key={'c': s})
# Should fail (missing sort key)
with pytest.raises(ClientError, match='ValidationException'):
test_table.delete_item(Key={'p': s})
# Should fail (spurious key columns)
with pytest.raises(ClientError, match='ValidationException'):
test_table.delete_item(Key={'p': s, 'c': s, 'spurious': s})
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.delete_item(Key={'p': s, 'c': s})
# Most of the tests here arbitrarily used a table with both hash and sort keys
# (both strings). Let's check that a table with *only* a hash key works ok
# too, for PutItem, GetItem, and UpdateItem.
def test_only_hash_key(test_table_s):
s = random_string()
test_table_s.put_item(Item={'p': s, 'hello': 'world'})
assert test_table_s.get_item(Key={'p': s}, ConsistentRead=True)['Item'] == {'p': s, 'hello': 'world'}
test_table_s.update_item(Key={'p': s}, AttributeUpdates={'hi': {'Value': 'there', 'Action': 'PUT'}})
assert test_table_s.get_item(Key={'p': s}, ConsistentRead=True)['Item'] == {'p': s, 'hello': 'world', 'hi': 'there'}
# Tests for item operations in tables with non-string hash or sort keys.
# These tests focus only on the type of the key - everything else is as
# simple as we can (string attributes, no special options for GetItem
# and PutItem). These tests also focus on individual items only, and
# not about the sort order of sort keys - this should be verified in
# test_query.py, for example.
def test_bytes_hash_key(test_table_b):
# Bytes values are passed using base64 encoding, which has weird cases
# depending on len%3 and len%4. So let's try various lengths.
for len in range(10,18):
p = random_bytes(len)
val = random_string()
test_table_b.put_item(Item={'p': p, 'attribute': val})
assert test_table_b.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'attribute': val}
def test_bytes_sort_key(test_table_sb):
p = random_string()
c = random_bytes()
val = random_string()
test_table_sb.put_item(Item={'p': p, 'c': c, 'attribute': val})
assert test_table_sb.get_item(Key={'p': p, 'c': c}, ConsistentRead=True)['Item'] == {'p': p, 'c': c, 'attribute': val}
# Tests for using a large binary blob as hash key, sort key, or attribute.
# DynamoDB strictly limits the size of the binary hash key to 2048 bytes,
# and binary sort key to 1024 bytes, and refuses anything larger. The total
# size of an item is limited to 400KB, which also limits the size of the
# largest attributes. For more details on these limits, see
# https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html
# Alternator currently does *not* have these limitations, and can accept much
# larger keys and attributes, but what we do in the following tests is to verify
# that items up to DynamoDB's maximum sizes also work well in Alternator.
def test_large_blob_hash_key(test_table_b):
b = random_bytes(2048)
test_table_b.put_item(Item={'p': b})
assert test_table_b.get_item(Key={'p': b}, ConsistentRead=True)['Item'] == {'p': b}
def test_large_blob_sort_key(test_table_sb):
s = random_string()
b = random_bytes(1024)
test_table_sb.put_item(Item={'p': s, 'c': b})
assert test_table_sb.get_item(Key={'p': s, 'c': b}, ConsistentRead=True)['Item'] == {'p': s, 'c': b}
def test_large_blob_attribute(test_table):
p = random_string()
c = random_string()
b = random_bytes(409500) # a bit less than 400KB
test_table.put_item(Item={'p': p, 'c': c, 'attribute': b })
assert test_table.get_item(Key={'p': p, 'c': c}, ConsistentRead=True)['Item'] == {'p': p, 'c': c, 'attribute': b}
# Checks what it is not allowed to use in a single UpdateItem request both
# old-style AttributeUpdates and new-style UpdateExpression.
def test_update_item_two_update_methods(test_table_s):
p = random_string()
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'a': {'Value': 3, 'Action': 'PUT'}},
UpdateExpression='SET b = :val1',
ExpressionAttributeValues={':val1': 4})
# Verify that having neither AttributeUpdates nor UpdateExpression is
# allowed, and results in creation of an empty item.
def test_update_item_no_update_method(test_table_s):
p = random_string()
assert not "Item" in test_table_s.get_item(Key={'p': p}, ConsistentRead=True)
test_table_s.update_item(Key={'p': p})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p}
# Test GetItem with the AttributesToGet parameter. Result should include the
# selected attributes only - if one wants the key attributes as well, one
# needs to select them explicitly. When no key attributes are selected,
# some items may have *none* of the selected attributes. Those items are
# returned too, as empty items - they are not outright missing.
def test_getitem_attributes_to_get(dynamodb, test_table):
p = random_string()
c = random_string()
item = {'p': p, 'c': c, 'a': 'hello', 'b': 'hi'}
test_table.put_item(Item=item)
for wanted in [ ['a'], # only non-key attribute
['c', 'a'], # a key attribute (sort key) and non-key
['p', 'c'], # entire key
['nonexistent'] # Our item doesn't have this
]:
got_item = test_table.get_item(Key={'p': p, 'c': c}, AttributesToGet=wanted, ConsistentRead=True)['Item']
expected_item = {k: item[k] for k in wanted if k in item}
assert expected_item == got_item
# Basic test for DeleteItem, with hash key only
def test_delete_item_hash(test_table_s):
p = random_string()
test_table_s.put_item(Item={'p': p})
assert 'Item' in test_table_s.get_item(Key={'p': p}, ConsistentRead=True)
test_table_s.delete_item(Key={'p': p})
assert not 'Item' in test_table_s.get_item(Key={'p': p}, ConsistentRead=True)
# Basic test for DeleteItem, with hash and sort key
def test_delete_item_sort(test_table):
p = random_string()
c = random_string()
key = {'p': p, 'c': c}
test_table.put_item(Item=key)
assert 'Item' in test_table.get_item(Key=key, ConsistentRead=True)
test_table.delete_item(Key=key)
assert not 'Item' in test_table.get_item(Key=key, ConsistentRead=True)
# Test that PutItem completely replaces an existing item. It shouldn't merge
# it with a previously existing value, as UpdateItem does!
# We test for a table with just hash key, and for a table with both hash and
# sort keys.
def test_put_item_replace(test_table_s, test_table):
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': 'hi'})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'a': 'hi'}
test_table_s.put_item(Item={'p': p, 'b': 'hello'})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'b': 'hello'}
c = random_string()
test_table.put_item(Item={'p': p, 'c': c, 'a': 'hi'})
assert test_table.get_item(Key={'p': p, 'c': c}, ConsistentRead=True)['Item'] == {'p': p, 'c': c, 'a': 'hi'}
test_table.put_item(Item={'p': p, 'c': c, 'b': 'hello'})
assert test_table.get_item(Key={'p': p, 'c': c}, ConsistentRead=True)['Item'] == {'p': p, 'c': c, 'b': 'hello'}

View File

@@ -1,358 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2019 ScyllaDB
#
# This file is part of Scylla.
#
# Scylla is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Scylla is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with Scylla. If not, see <http://www.gnu.org/licenses/>.
# Tests for the Query operation
import random
import pytest
from botocore.exceptions import ClientError
from decimal import Decimal
from util import random_string, random_bytes, full_query, multiset
from boto3.dynamodb.conditions import Key, Attr
# Test that scanning works fine with in-stock paginator
def test_query_basic_restrictions(dynamodb, filled_test_table):
test_table, items = filled_test_table
paginator = dynamodb.meta.client.get_paginator('query')
# EQ
got_items = []
for page in paginator.paginate(TableName=test_table.name, KeyConditions={
'p' : {'AttributeValueList': ['long'], 'ComparisonOperator': 'EQ'}
}):
got_items += page['Items']
print(got_items)
assert multiset([item for item in items if item['p'] == 'long']) == multiset(got_items)
# LT
got_items = []
for page in paginator.paginate(TableName=test_table.name, KeyConditions={
'p' : {'AttributeValueList': ['long'], 'ComparisonOperator': 'EQ'},
'c' : {'AttributeValueList': ['12'], 'ComparisonOperator': 'LT'}
}):
got_items += page['Items']
print(got_items)
assert multiset([item for item in items if item['p'] == 'long' and item['c'] < '12']) == multiset(got_items)
# LE
got_items = []
for page in paginator.paginate(TableName=test_table.name, KeyConditions={
'p' : {'AttributeValueList': ['long'], 'ComparisonOperator': 'EQ'},
'c' : {'AttributeValueList': ['14'], 'ComparisonOperator': 'LE'}
}):
got_items += page['Items']
print(got_items)
assert multiset([item for item in items if item['p'] == 'long' and item['c'] <= '14']) == multiset(got_items)
# GT
got_items = []
for page in paginator.paginate(TableName=test_table.name, KeyConditions={
'p' : {'AttributeValueList': ['long'], 'ComparisonOperator': 'EQ'},
'c' : {'AttributeValueList': ['15'], 'ComparisonOperator': 'GT'}
}):
got_items += page['Items']
print(got_items)
assert multiset([item for item in items if item['p'] == 'long' and item['c'] > '15']) == multiset(got_items)
# GE
got_items = []
for page in paginator.paginate(TableName=test_table.name, KeyConditions={
'p' : {'AttributeValueList': ['long'], 'ComparisonOperator': 'EQ'},
'c' : {'AttributeValueList': ['14'], 'ComparisonOperator': 'GE'}
}):
got_items += page['Items']
print(got_items)
assert multiset([item for item in items if item['p'] == 'long' and item['c'] >= '14']) == multiset(got_items)
# BETWEEN
got_items = []
for page in paginator.paginate(TableName=test_table.name, KeyConditions={
'p' : {'AttributeValueList': ['long'], 'ComparisonOperator': 'EQ'},
'c' : {'AttributeValueList': ['155', '164'], 'ComparisonOperator': 'BETWEEN'}
}):
got_items += page['Items']
print(got_items)
assert multiset([item for item in items if item['p'] == 'long' and item['c'] >= '155' and item['c'] <= '164']) == multiset(got_items)
# BEGINS_WITH
got_items = []
for page in paginator.paginate(TableName=test_table.name, KeyConditions={
'p' : {'AttributeValueList': ['long'], 'ComparisonOperator': 'EQ'},
'c' : {'AttributeValueList': ['11'], 'ComparisonOperator': 'BEGINS_WITH'}
}):
print([item for item in items if item['p'] == 'long' and item['c'].startswith('11')])
got_items += page['Items']
print(got_items)
assert multiset([item for item in items if item['p'] == 'long' and item['c'].startswith('11')]) == multiset(got_items)
# Test that KeyConditionExpression parameter is supported
@pytest.mark.xfail(reason="KeyConditionExpression not supported yet")
def test_query_key_condition_expression(dynamodb, filled_test_table):
test_table, items = filled_test_table
paginator = dynamodb.meta.client.get_paginator('query')
got_items = []
for page in paginator.paginate(TableName=test_table.name, KeyConditionExpression=Key("p").eq("long") & Key("c").lt("12")):
got_items += page['Items']
print(got_items)
assert multiset([item for item in items if item['p'] == 'long' and item['c'] < '12']) == multiset(got_items)
def test_begins_with(dynamodb, test_table):
paginator = dynamodb.meta.client.get_paginator('query')
items = [{'p': 'unorthodox_chars', 'c': sort_key, 'str': 'a'} for sort_key in [u'ÿÿÿ', u'cÿbÿ', u'cÿbÿÿabg'] ]
with test_table.batch_writer() as batch:
for item in items:
batch.put_item(item)
# TODO(sarna): Once bytes type is supported, /xFF character should be tested
got_items = []
for page in paginator.paginate(TableName=test_table.name, KeyConditions={
'p' : {'AttributeValueList': ['unorthodox_chars'], 'ComparisonOperator': 'EQ'},
'c' : {'AttributeValueList': [u'ÿÿ'], 'ComparisonOperator': 'BEGINS_WITH'}
}):
got_items += page['Items']
print(got_items)
assert sorted([d['c'] for d in got_items]) == sorted([d['c'] for d in items if d['c'].startswith(u'ÿÿ')])
got_items = []
for page in paginator.paginate(TableName=test_table.name, KeyConditions={
'p' : {'AttributeValueList': ['unorthodox_chars'], 'ComparisonOperator': 'EQ'},
'c' : {'AttributeValueList': [u'cÿbÿ'], 'ComparisonOperator': 'BEGINS_WITH'}
}):
got_items += page['Items']
print(got_items)
assert sorted([d['c'] for d in got_items]) == sorted([d['c'] for d in items if d['c'].startswith(u'cÿbÿ')])
def test_begins_with_wrong_type(dynamodb, test_table_sn):
paginator = dynamodb.meta.client.get_paginator('query')
with pytest.raises(ClientError, match='ValidationException'):
for page in paginator.paginate(TableName=test_table_sn.name, KeyConditions={
'p' : {'AttributeValueList': ['unorthodox_chars'], 'ComparisonOperator': 'EQ'},
'c' : {'AttributeValueList': [17], 'ComparisonOperator': 'BEGINS_WITH'}
}):
pass
# Items returned by Query should be sorted by the sort key. The following
# tests verify that this is indeed the case, for the three allowed key types:
# strings, binary, and numbers. These tests test not just the Query operation,
# but inherently that the sort-key sorting works.
def test_query_sort_order_string(test_table):
# Insert a lot of random items in one new partition:
# str(i) has a non-obvious sort order (e.g., "100" comes before "2") so is a nice test.
p = random_string()
items = [{'p': p, 'c': str(i)} for i in range(128)]
with test_table.batch_writer() as batch:
for item in items:
batch.put_item(item)
got_items = full_query(test_table, KeyConditions={'p': {'AttributeValueList': [p], 'ComparisonOperator': 'EQ'}})
assert len(items) == len(got_items)
# Extract just the sort key ("c") from the items
sort_keys = [x['c'] for x in items]
got_sort_keys = [x['c'] for x in got_items]
# Verify that got_sort_keys are already sorted (in string order)
assert sorted(got_sort_keys) == got_sort_keys
# Verify that got_sort_keys are a sorted version of the expected sort_keys
assert sorted(sort_keys) == got_sort_keys
def test_query_sort_order_bytes(test_table_sb):
# Insert a lot of random items in one new partition:
# We arbitrarily use random_bytes with a random length.
p = random_string()
items = [{'p': p, 'c': random_bytes(10)} for i in range(128)]
with test_table_sb.batch_writer() as batch:
for item in items:
batch.put_item(item)
got_items = full_query(test_table_sb, KeyConditions={'p': {'AttributeValueList': [p], 'ComparisonOperator': 'EQ'}})
assert len(items) == len(got_items)
sort_keys = [x['c'] for x in items]
got_sort_keys = [x['c'] for x in got_items]
# Boto3's "Binary" objects are sorted as if bytes are signed integers.
# This isn't the order that DynamoDB itself uses (byte 0 should be first,
# not byte -128). Sorting the byte array ".value" works.
assert sorted(got_sort_keys, key=lambda x: x.value) == got_sort_keys
assert sorted(sort_keys) == got_sort_keys
def test_query_sort_order_number(test_table_sn):
# This is a list of numbers, sorted in correct order, and each suitable
# for accurate representation by Alternator's number type.
numbers = [
Decimal("-2e10"),
Decimal("-7.1e2"),
Decimal("-4.1"),
Decimal("-0.1"),
Decimal("-1e-5"),
Decimal("0"),
Decimal("2e-5"),
Decimal("0.15"),
Decimal("1"),
Decimal("1.00000000000000000000000001"),
Decimal("3.14159"),
Decimal("3.1415926535897932384626433832795028841"),
Decimal("31.4"),
Decimal("1.4e10"),
]
# Insert these numbers, in random order, into one partition:
p = random_string()
items = [{'p': p, 'c': num} for num in random.sample(numbers, len(numbers))]
with test_table_sn.batch_writer() as batch:
for item in items:
batch.put_item(item)
# Finally, verify that we get back exactly the same numbers (with identical
# precision), and in their original sorted order.
got_items = full_query(test_table_sn, KeyConditions={'p': {'AttributeValueList': [p], 'ComparisonOperator': 'EQ'}})
got_sort_keys = [x['c'] for x in got_items]
assert got_sort_keys == numbers
def test_query_filtering_attributes_equality(filled_test_table):
test_table, items = filled_test_table
query_filter = {
"attribute" : {
"AttributeValueList" : [ "xxxx" ],
"ComparisonOperator": "EQ"
}
}
got_items = full_query(test_table, KeyConditions={'p': {'AttributeValueList': ['long'], 'ComparisonOperator': 'EQ'}}, QueryFilter=query_filter)
print(got_items)
assert multiset([item for item in items if item['p'] == 'long' and item['attribute'] == 'xxxx']) == multiset(got_items)
query_filter = {
"attribute" : {
"AttributeValueList" : [ "xxxx" ],
"ComparisonOperator": "EQ"
},
"another" : {
"AttributeValueList" : [ "yy" ],
"ComparisonOperator": "EQ"
}
}
got_items = full_query(test_table, KeyConditions={'p': {'AttributeValueList': ['long'], 'ComparisonOperator': 'EQ'}}, QueryFilter=query_filter)
print(got_items)
assert multiset([item for item in items if item['p'] == 'long' and item['attribute'] == 'xxxx' and item['another'] == 'yy']) == multiset(got_items)
# Test that FilterExpression works as expected
@pytest.mark.xfail(reason="FilterExpression not supported yet")
def test_query_filter_expression(filled_test_table):
test_table, items = filled_test_table
got_items = full_query(test_table, KeyConditions={'p': {'AttributeValueList': ['long'], 'ComparisonOperator': 'EQ'}}, FilterExpression=Attr("attribute").eq("xxxx"))
print(got_items)
assert multiset([item for item in items if item['p'] == 'long' and item['attribute'] == 'xxxx']) == multiset(got_items)
got_items = full_query(test_table, KeyConditions={'p': {'AttributeValueList': ['long'], 'ComparisonOperator': 'EQ'}}, FilterExpression=Attr("attribute").eq("xxxx") & Attr("another").eq("yy"))
print(got_items)
assert multiset([item for item in items if item['p'] == 'long' and item['attribute'] == 'xxxx' and item['another'] == 'yy']) == multiset(got_items)
# QueryFilter can only contain non-key attributes in order to be compatible
def test_query_filtering_key_equality(filled_test_table):
test_table, items = filled_test_table
with pytest.raises(ClientError, match='ValidationException'):
query_filter = {
"c" : {
"AttributeValueList" : [ "5" ],
"ComparisonOperator": "EQ"
}
}
got_items = full_query(test_table, KeyConditions={'p': {'AttributeValueList': ['long'], 'ComparisonOperator': 'EQ'}}, QueryFilter=query_filter)
print(got_items)
with pytest.raises(ClientError, match='ValidationException'):
query_filter = {
"attribute" : {
"AttributeValueList" : [ "x" ],
"ComparisonOperator": "EQ"
},
"p" : {
"AttributeValueList" : [ "5" ],
"ComparisonOperator": "EQ"
}
}
got_items = full_query(test_table, KeyConditions={'p': {'AttributeValueList': ['long'], 'ComparisonOperator': 'EQ'}}, QueryFilter=query_filter)
print(got_items)
# Test Query with the AttributesToGet parameter. Result should include the
# selected attributes only - if one wants the key attributes as well, one
# needs to select them explicitly. When no key attributes are selected,
# some items may have *none* of the selected attributes. Those items are
# returned too, as empty items - they are not outright missing.
def test_query_attributes_to_get(dynamodb, test_table):
p = random_string()
items = [{'p': p, 'c': str(i), 'a': str(i*10), 'b': str(i*100) } for i in range(10)]
with test_table.batch_writer() as batch:
for item in items:
batch.put_item(item)
for wanted in [ ['a'], # only non-key attributes
['c', 'a'], # a key attribute (sort key) and non-key
['p', 'c'], # entire key
['nonexistent'] # none of the items have this attribute!
]:
got_items = full_query(test_table, KeyConditions={'p': {'AttributeValueList': [p], 'ComparisonOperator': 'EQ'}}, AttributesToGet=wanted)
expected_items = [{k: x[k] for k in wanted if k in x} for x in items]
assert multiset(expected_items) == multiset(got_items)
# Test that in a table with both hash key and sort key, which keys we can
# Query by: We can Query by the hash key, by a combination of both hash and
# sort keys, but *cannot* query by just the sort key, and obviously not
# by any non-key column.
def test_query_which_key(test_table):
p = random_string()
c = random_string()
p2 = random_string()
c2 = random_string()
item1 = {'p': p, 'c': c}
item2 = {'p': p, 'c': c2}
item3 = {'p': p2, 'c': c}
for i in [item1, item2, item3]:
test_table.put_item(Item=i)
# Query by hash key only:
got_items = full_query(test_table, KeyConditions={'p': {'AttributeValueList': [p], 'ComparisonOperator': 'EQ'}})
expected_items = [item1, item2]
assert multiset(expected_items) == multiset(got_items)
# Query by hash key *and* sort key (this is basically a GetItem):
got_items = full_query(test_table, KeyConditions={
'p': {'AttributeValueList': [p], 'ComparisonOperator': 'EQ'},
'c': {'AttributeValueList': [c], 'ComparisonOperator': 'EQ'}
})
expected_items = [item1]
assert multiset(expected_items) == multiset(got_items)
# Query by sort key alone is not allowed. DynamoDB reports:
# "Query condition missed key schema element: p".
with pytest.raises(ClientError, match='ValidationException'):
full_query(test_table, KeyConditions={
'c': {'AttributeValueList': [c], 'ComparisonOperator': 'EQ'}
})
# Query by a non-key isn't allowed, for the same reason - that the
# actual hash key (p) is missing in the query:
with pytest.raises(ClientError, match='ValidationException'):
full_query(test_table, KeyConditions={
'z': {'AttributeValueList': [c], 'ComparisonOperator': 'EQ'}
})
# If we try both p and a non-key we get a complaint that the sort
# key is missing: "Query condition missed key schema element: c"
with pytest.raises(ClientError, match='ValidationException'):
full_query(test_table, KeyConditions={
'p': {'AttributeValueList': [p], 'ComparisonOperator': 'EQ'},
'z': {'AttributeValueList': [c], 'ComparisonOperator': 'EQ'}
})
# If we try p, c and another key, we get an error that
# "Conditions can be of length 1 or 2 only".
with pytest.raises(ClientError, match='ValidationException'):
full_query(test_table, KeyConditions={
'p': {'AttributeValueList': [p], 'ComparisonOperator': 'EQ'},
'c': {'AttributeValueList': [c], 'ComparisonOperator': 'EQ'},
'z': {'AttributeValueList': [c], 'ComparisonOperator': 'EQ'}
})

View File

@@ -1,191 +0,0 @@
# Copyright 2019 ScyllaDB
#
# This file is part of Scylla.
#
# Scylla is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Scylla is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with Scylla. If not, see <http://www.gnu.org/licenses/>.
# Tests for the Scan operation
import pytest
from botocore.exceptions import ClientError
from util import random_string, full_scan, multiset
from boto3.dynamodb.conditions import Attr
# Test that scanning works fine with/without pagination
def test_scan_basic(filled_test_table):
test_table, items = filled_test_table
for limit in [None,1,2,4,33,50,100,9007,16*1024*1024]:
pos = None
got_items = []
while True:
if limit:
response = test_table.scan(Limit=limit, ExclusiveStartKey=pos) if pos else test_table.scan(Limit=limit)
assert len(response['Items']) <= limit
else:
response = test_table.scan(ExclusiveStartKey=pos) if pos else test_table.scan()
pos = response.get('LastEvaluatedKey', None)
got_items += response['Items']
if not pos:
break
assert len(items) == len(got_items)
assert multiset(items) == multiset(got_items)
def test_scan_with_paginator(dynamodb, filled_test_table):
test_table, items = filled_test_table
paginator = dynamodb.meta.client.get_paginator('scan')
got_items = []
for page in paginator.paginate(TableName=test_table.name):
got_items += page['Items']
assert len(items) == len(got_items)
assert multiset(items) == multiset(got_items)
for page_size in [1, 17, 1234]:
got_items = []
for page in paginator.paginate(TableName=test_table.name, PaginationConfig={'PageSize': page_size}):
got_items += page['Items']
assert len(items) == len(got_items)
assert multiset(items) == multiset(got_items)
# Although partitions are scanned in seemingly-random order, inside a
# partition items must be returned by Scan sorted in sort-key order.
# This test verifies this, for string sort key. We'll need separate
# tests for the other sort-key types (number and binary)
def test_scan_sort_order_string(filled_test_table):
test_table, items = filled_test_table
got_items = full_scan(test_table)
assert len(items) == len(got_items)
# Extract just the sort key ("c") from the partition "long"
items_long = [x['c'] for x in items if x['p'] == 'long']
got_items_long = [x['c'] for x in got_items if x['p'] == 'long']
# Verify that got_items_long are already sorted (in string order)
assert sorted(got_items_long) == got_items_long
# Verify that got_items_long are a sorted version of the expected items_long
assert sorted(items_long) == got_items_long
# Test Scan with the AttributesToGet parameter. Result should include the
# selected attributes only - if one wants the key attributes as well, one
# needs to select them explicitly. When no key attributes are selected,
# some items may have *none* of the selected attributes. Those items are
# returned too, as empty items - they are not outright missing.
def test_scan_attributes_to_get(dynamodb, filled_test_table):
table, items = filled_test_table
for wanted in [ ['another'], # only non-key attributes (one item doesn't have it!)
['c', 'another'], # a key attribute (sort key) and non-key
['p', 'c'], # entire key
['nonexistent'] # none of the items have this attribute!
]:
print(wanted)
got_items = full_scan(table, AttributesToGet=wanted)
expected_items = [{k: x[k] for k in wanted if k in x} for x in items]
assert multiset(expected_items) == multiset(got_items)
def test_scan_with_attribute_equality_filtering(dynamodb, filled_test_table):
table, items = filled_test_table
scan_filter = {
"attribute" : {
"AttributeValueList" : [ "xxxxx" ],
"ComparisonOperator": "EQ"
}
}
got_items = full_scan(table, ScanFilter=scan_filter)
expected_items = [item for item in items if "attribute" in item.keys() and item["attribute"] == "xxxxx" ]
assert multiset(expected_items) == multiset(got_items)
scan_filter = {
"another" : {
"AttributeValueList" : [ "y" ],
"ComparisonOperator": "EQ"
},
"attribute" : {
"AttributeValueList" : [ "xxxxx" ],
"ComparisonOperator": "EQ"
}
}
got_items = full_scan(table, ScanFilter=scan_filter)
expected_items = [item for item in items if "attribute" in item.keys() and item["attribute"] == "xxxxx" and item["another"] == "y" ]
assert multiset(expected_items) == multiset(got_items)
# Test that FilterExpression works as expected
@pytest.mark.xfail(reason="FilterExpression not supported yet")
def test_scan_filter_expression(filled_test_table):
test_table, items = filled_test_table
got_items = full_scan(test_table, FilterExpression=Attr("attribute").eq("xxxx"))
print(got_items)
assert multiset([item for item in items if 'attribute' in item.keys() and item['attribute'] == 'xxxx']) == multiset(got_items)
got_items = full_scan(test_table, FilterExpression=Attr("attribute").eq("xxxx") & Attr("another").eq("yy"))
print(got_items)
assert multiset([item for item in items if 'attribute' in item.keys() and 'another' in item.keys() and item['attribute'] == 'xxxx' and item['another'] == 'yy']) == multiset(got_items)
def test_scan_with_key_equality_filtering(dynamodb, filled_test_table):
table, items = filled_test_table
scan_filter_p = {
"p" : {
"AttributeValueList" : [ "7" ],
"ComparisonOperator": "EQ"
}
}
scan_filter_c = {
"c" : {
"AttributeValueList" : [ "9" ],
"ComparisonOperator": "EQ"
}
}
scan_filter_p_and_attribute = {
"p" : {
"AttributeValueList" : [ "7" ],
"ComparisonOperator": "EQ"
},
"attribute" : {
"AttributeValueList" : [ "x"*7 ],
"ComparisonOperator": "EQ"
}
}
scan_filter_c_and_another = {
"c" : {
"AttributeValueList" : [ "9" ],
"ComparisonOperator": "EQ"
},
"another" : {
"AttributeValueList" : [ "y"*16 ],
"ComparisonOperator": "EQ"
}
}
# Filtering on the hash key
got_items = full_scan(table, ScanFilter=scan_filter_p)
expected_items = [item for item in items if "p" in item.keys() and item["p"] == "7" ]
assert multiset(expected_items) == multiset(got_items)
# Filtering on the sort key
got_items = full_scan(table, ScanFilter=scan_filter_c)
expected_items = [item for item in items if "c" in item.keys() and item["c"] == "9"]
assert multiset(expected_items) == multiset(got_items)
# Filtering on the hash key and an attribute
got_items = full_scan(table, ScanFilter=scan_filter_p_and_attribute)
expected_items = [item for item in items if "p" in item.keys() and "another" in item.keys() and item["p"] == "7" and item["another"] == "y"*16]
assert multiset(expected_items) == multiset(got_items)
# Filtering on the sort key and an attribute
got_items = full_scan(table, ScanFilter=scan_filter_c_and_another)
expected_items = [item for item in items if "c" in item.keys() and "another" in item.keys() and item["c"] == "9" and item["another"] == "y"*16]
assert multiset(expected_items) == multiset(got_items)

View File

@@ -1,121 +0,0 @@
# Copyright 2019 ScyllaDB
#
# This file is part of Scylla.
#
# Scylla is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Scylla is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with Scylla. If not, see <http://www.gnu.org/licenses/>.
# Various utility functions which are useful for multiple tests
import string
import random
import collections
import time
def random_string(length=10, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for x in range(length))
def random_bytes(length=10):
return bytearray(random.getrandbits(8) for _ in range(length))
# Utility functions for scan and query into an array of items:
# TODO: add to full_scan and full_query by default ConsistentRead=True, as
# it's not useful for tests without it!
def full_scan(table, **kwargs):
response = table.scan(**kwargs)
items = response['Items']
while 'LastEvaluatedKey' in response:
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'], **kwargs)
items.extend(response['Items'])
return items
# Utility function for fetching the entire results of a query into an array of items
def full_query(table, **kwargs):
response = table.query(**kwargs)
items = response['Items']
while 'LastEvaluatedKey' in response:
response = table.query(ExclusiveStartKey=response['LastEvaluatedKey'], **kwargs)
items.extend(response['Items'])
return items
# To compare two lists of items (each is a dict) without regard for order,
# "==" is not good enough because it will fail if the order is different.
# The following function, multiset() converts the list into a multiset
# (set with duplicates) where order doesn't matter, so the multisets can
# be compared.
def freeze(item):
if isinstance(item, dict):
return frozenset((key, freeze(value)) for key, value in item.items())
elif isinstance(item, list):
return tuple(freeze(value) for value in item)
return item
def multiset(items):
return collections.Counter([freeze(item) for item in items])
test_table_prefix = 'alternator_test_'
def test_table_name():
current_ms = int(round(time.time() * 1000))
# In the off chance that test_table_name() is called twice in the same millisecond...
if test_table_name.last_ms >= current_ms:
current_ms = test_table_name.last_ms + 1
test_table_name.last_ms = current_ms
return test_table_prefix + str(current_ms)
test_table_name.last_ms = 0
def create_test_table(dynamodb, **kwargs):
name = test_table_name()
print("fixture creating new table {}".format(name))
table = dynamodb.create_table(TableName=name,
BillingMode='PAY_PER_REQUEST', **kwargs)
waiter = table.meta.client.get_waiter('table_exists')
# recheck every second instead of the default, lower, frequency. This can
# save a few seconds on AWS with its very slow table creation, but can
# more on tests on Scylla with its faster table creation turnaround.
waiter.config.delay = 1
waiter.config.max_attempts = 200
waiter.wait(TableName=name)
return table
# DynamoDB's ListTables request returns up to a single page of table names
# (e.g., up to 100) and it is up to the caller to call it again and again
# to get the next page. This is a utility function which calls it repeatedly
# as much as necessary to get the entire list.
# We deliberately return a list and not a set, because we want the caller
# to be able to recognize bugs in ListTables which causes the same table
# to be returned twice.
def list_tables(dynamodb, limit=100):
ret = []
pos = None
while True:
if pos:
page = dynamodb.meta.client.list_tables(Limit=limit, ExclusiveStartTableName=pos);
else:
page = dynamodb.meta.client.list_tables(Limit=limit);
results = page.get('TableNames', None)
assert(results)
ret = ret + results
newpos = page.get('LastEvaluatedTableName', None)
if not newpos:
break;
# It doesn't make sense for Dynamo to tell us we need more pages, but
# not send anything in *this* page!
assert len(results) > 0
assert newpos != pos
# Note that we only checked that we got back tables, not that we got
# any new tables not already in ret. So a buggy implementation might
# still cause an endless loop getting the same tables again and again.
pos = newpos
return ret

View File

@@ -66,8 +66,9 @@ static std::string format_time_point(db_clock::time_point tp) {
time_t time_point_repr = db_clock::to_time_t(tp);
std::string time_point_str;
time_point_str.resize(17);
::tm time_buf;
// strftime prints the terminating null character as well
std::strftime(time_point_str.data(), time_point_str.size(), "%Y%m%dT%H%M%SZ", std::gmtime(&time_point_repr));
std::strftime(time_point_str.data(), time_point_str.size(), "%Y%m%dT%H%M%SZ", ::gmtime_r(&time_point_repr, &time_buf));
time_point_str.resize(16);
return time_point_str;
}
@@ -129,7 +130,7 @@ future<std::string> get_key_from_roles(cql3::query_processor& qp, std::string us
auto cl = auth::password_authenticator::consistency_for_user(username);
auto timeout = auth::internal_distributed_timeout_config();
return qp.process(query, cl, timeout, {sstring(username)}, true).then_wrapped([username = std::move(username)] (future<::shared_ptr<cql3::untyped_result_set>> f) {
return qp.execute_internal(query, cl, timeout, {sstring(username)}, true).then_wrapped([username = std::move(username)] (future<::shared_ptr<cql3::untyped_result_set>> f) {
auto res = f.get0();
auto salted_hash = std::optional<sstring>();
if (res->empty()) {

View File

@@ -29,6 +29,12 @@
#include "rjson.hh"
#include "serialization.hh"
#include "base64.hh"
#include <stdexcept>
#include <boost/algorithm/cxx11/all_of.hpp>
#include <boost/algorithm/cxx11/any_of.hpp>
#include "utils/overloaded_functor.hh"
#include "expressions_eval.hh"
namespace alternator {
@@ -47,7 +53,9 @@ comparison_operator_type get_comparison_operator(const rjson::value& comparison_
{"NOT_NULL", comparison_operator_type::NOT_NULL},
{"BETWEEN", comparison_operator_type::BETWEEN},
{"BEGINS_WITH", comparison_operator_type::BEGINS_WITH},
}; //TODO: CONTAINS
{"CONTAINS", comparison_operator_type::CONTAINS},
{"NOT_CONTAINS", comparison_operator_type::NOT_CONTAINS},
};
if (!comparison_operator.IsString()) {
throw api_error("ValidationException", format("Invalid comparison operator definition {}", rjson::print(comparison_operator)));
}
@@ -68,7 +76,7 @@ static ::shared_ptr<cql3::restrictions::single_column_restriction::contains> mak
}
static ::shared_ptr<cql3::restrictions::single_column_restriction::EQ> make_key_eq_restriction(const column_definition& cdef, const rjson::value& value) {
bytes raw_value = get_key_from_typed_value(value, cdef, type_to_string(cdef.type));
bytes raw_value = get_key_from_typed_value(value, cdef);
auto restriction_value = ::make_shared<cql3::constants::value>(cql3::raw_value::make_value(std::move(raw_value)));
return make_shared<cql3::restrictions::single_column_restriction::EQ>(cdef, std::move(restriction_value));
}
@@ -143,9 +151,44 @@ static void verify_operand_count(const rjson::value* array, const size_check& ex
}
}
struct rjson_engaged_ptr_comp {
bool operator()(const rjson::value* p1, const rjson::value* p2) const {
return rjson::single_value_comp()(*p1, *p2);
}
};
// It's not enough to compare underlying JSON objects when comparing sets,
// as internally they're stored in an array, and the order of elements is
// not important in set equality. See issue #5021
static bool check_EQ_for_sets(const rjson::value& set1, const rjson::value& set2) {
if (set1.Size() != set2.Size()) {
return false;
}
std::set<const rjson::value*, rjson_engaged_ptr_comp> set1_raw;
for (auto it = set1.Begin(); it != set1.End(); ++it) {
set1_raw.insert(&*it);
}
for (const auto& a : set2.GetArray()) {
if (set1_raw.count(&a) == 0) {
return false;
}
}
return true;
}
// Check if two JSON-encoded values match with the EQ relation
static bool check_EQ(const rjson::value* v1, const rjson::value& v2) {
return v1 && *v1 == v2;
if (!v1) {
return false;
}
if (v1->IsObject() && v1->MemberCount() == 1 && v2.IsObject() && v2.MemberCount() == 1) {
auto it1 = v1->MemberBegin();
auto it2 = v2.MemberBegin();
if ((it1->name == "SS" && it2->name == "SS") || (it1->name == "NS" && it2->name == "NS") || (it1->name == "BS" && it2->name == "BS")) {
return check_EQ_for_sets(it1->value, it2->value);
}
}
return *v1 == v2;
}
// Check if two JSON-encoded values match with the NE relation
@@ -174,9 +217,66 @@ static bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2) {
if (it1->name != it2->name) {
return false;
}
std::string_view val1(it1->value.GetString(), it1->value.GetStringLength());
std::string_view val2(it2->value.GetString(), it2->value.GetStringLength());
return val1.substr(0, val2.size()) == val2;
if (it2->name == "S") {
std::string_view val1(it1->value.GetString(), it1->value.GetStringLength());
std::string_view val2(it2->value.GetString(), it2->value.GetStringLength());
return val1.substr(0, val2.size()) == val2;
} else /* it2->name == "B" */ {
// TODO (optimization): Check the begins_with condition directly on
// the base64-encoded string, without making a decoded copy.
bytes val1 = base64_decode(it1->value);
bytes val2 = base64_decode(it2->value);
return val1.substr(0, val2.size()) == val2;
}
}
static bool is_set_of(const rjson::value& type1, const rjson::value& type2) {
return (type2 == "S" && type1 == "SS") || (type2 == "N" && type1 == "NS") || (type2 == "B" && type1 == "BS");
}
// Check if two JSON-encoded values match with the CONTAINS relation
bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2) {
if (!v1) {
return false;
}
const auto& kv1 = *v1->MemberBegin();
const auto& kv2 = *v2.MemberBegin();
if (kv2.name != "S" && kv2.name != "N" && kv2.name != "B") {
throw api_error("ValidationException",
format("CONTAINS operator requires a single AttributeValue of type String, Number, or Binary, "
"got {} instead", kv2.name));
}
if (kv1.name == "S" && kv2.name == "S") {
return rjson::to_string_view(kv1.value).find(rjson::to_string_view(kv2.value)) != std::string_view::npos;
} else if (kv1.name == "B" && kv2.name == "B") {
return base64_decode(kv1.value).find(base64_decode(kv2.value)) != bytes::npos;
} else if (is_set_of(kv1.name, kv2.name)) {
for (auto i = kv1.value.Begin(); i != kv1.value.End(); ++i) {
if (*i == kv2.value) {
return true;
}
}
} else if (kv1.name == "L") {
for (auto i = kv1.value.Begin(); i != kv1.value.End(); ++i) {
if (!i->IsObject() || i->MemberCount() != 1) {
clogger.error("check_CONTAINS received a list whose element is malformed");
return false;
}
const auto& el = *i->MemberBegin();
if (el.name == kv2.name && el.value == kv2.value) {
return true;
}
}
}
return false;
}
// Check if two JSON-encoded values match with the NOT_CONTAINS relation
static bool check_NOT_CONTAINS(const rjson::value* v1, const rjson::value& v2) {
if (!v1) {
return false;
}
return !check_CONTAINS(v1, v2);
}
// Check if a JSON-encoded value equals any element of an array, which must have at least one element.
@@ -207,6 +307,19 @@ static bool check_IN(const rjson::value* val, const rjson::value& array) {
return have_match;
}
// Another variant of check_IN, this one for ConditionExpression. It needs to
// check whether the first element in the given vector is equal to any of the
// others.
static bool check_IN(const std::vector<rjson::value>& array) {
const rjson::value* first = &array[0];
for (unsigned i = 1; i < array.size(); i++) {
if (check_EQ(first, array[i])) {
return true;
}
}
return false;
}
static bool check_NULL(const rjson::value* val) {
return val == nullptr;
}
@@ -221,13 +334,13 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error("ValidationException",
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic()));
cmp.diagnostic));
}
const auto& kv2 = *v2.MemberBegin();
if (kv2.name != "S" && kv2.name != "N" && kv2.name != "B") {
throw api_error("ValidationException",
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic()));
cmp.diagnostic));
}
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
return false;
@@ -237,7 +350,7 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
return false;
}
if (kv1.name == "N") {
return cmp(unwrap_number(*v1, cmp.diagnostic()), unwrap_number(v2, cmp.diagnostic()));
return cmp(unwrap_number(*v1, cmp.diagnostic), unwrap_number(v2, cmp.diagnostic));
}
if (kv1.name == "S") {
return cmp(std::string_view(kv1.value.GetString(), kv1.value.GetStringLength()),
@@ -252,15 +365,84 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
struct cmp_lt {
template <typename T> bool operator()(const T& lhs, const T& rhs) const { return lhs < rhs; }
const char* diagnostic() const { return "LT operator"; }
// We cannot use the normal comparison operators like "<" on the bytes
// type, because they treat individual bytes as signed but we need to
// compare them as *unsigned*. So we need a specialization for bytes.
bool operator()(const bytes& lhs, const bytes& rhs) const { return compare_unsigned(lhs, rhs) < 0; }
static constexpr const char* diagnostic = "LT operator";
};
struct cmp_le {
template <typename T> bool operator()(const T& lhs, const T& rhs) const { return lhs <= rhs; }
bool operator()(const bytes& lhs, const bytes& rhs) const { return compare_unsigned(lhs, rhs) <= 0; }
static constexpr const char* diagnostic = "LE operator";
};
struct cmp_ge {
template <typename T> bool operator()(const T& lhs, const T& rhs) const { return lhs >= rhs; }
bool operator()(const bytes& lhs, const bytes& rhs) const { return compare_unsigned(lhs, rhs) >= 0; }
static constexpr const char* diagnostic = "GE operator";
};
struct cmp_gt {
// bytes only has <
template <typename T> bool operator()(const T& lhs, const T& rhs) const { return rhs < lhs; }
const char* diagnostic() const { return "GT operator"; }
template <typename T> bool operator()(const T& lhs, const T& rhs) const { return lhs > rhs; }
bool operator()(const bytes& lhs, const bytes& rhs) const { return compare_unsigned(lhs, rhs) > 0; }
static constexpr const char* diagnostic = "GT operator";
};
// True if v is between lb and ub, inclusive. Throws if lb > ub.
template <typename T>
bool check_BETWEEN(const T& v, const T& lb, const T& ub) {
if (cmp_lt()(ub, lb)) {
throw api_error("ValidationException",
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
}
return cmp_ge()(v, lb) && cmp_le()(v, ub);
}
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub) {
if (!v) {
return false;
}
if (!v->IsObject() || v->MemberCount() != 1) {
throw api_error("ValidationException", format("BETWEEN operator encountered malformed AttributeValue: {}", *v));
}
if (!lb.IsObject() || lb.MemberCount() != 1) {
throw api_error("ValidationException", format("BETWEEN operator encountered malformed AttributeValue: {}", lb));
}
if (!ub.IsObject() || ub.MemberCount() != 1) {
throw api_error("ValidationException", format("BETWEEN operator encountered malformed AttributeValue: {}", ub));
}
const auto& kv_v = *v->MemberBegin();
const auto& kv_lb = *lb.MemberBegin();
const auto& kv_ub = *ub.MemberBegin();
if (kv_lb.name != kv_ub.name) {
throw api_error(
"ValidationException",
format("BETWEEN operator requires the same type for lower and upper bound; instead got {} and {}",
kv_lb.name, kv_ub.name));
}
if (kv_v.name != kv_lb.name) { // Cannot compare different types, so v is NOT between lb and ub.
return false;
}
if (kv_v.name == "N") {
const char* diag = "BETWEEN operator";
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag));
}
if (kv_v.name == "S") {
return check_BETWEEN(std::string_view(kv_v.value.GetString(), kv_v.value.GetStringLength()),
std::string_view(kv_lb.value.GetString(), kv_lb.value.GetStringLength()),
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()));
}
if (kv_v.name == "B") {
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value));
}
throw api_error("ValidationException",
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
kv_lb.name));
}
// Verify one Expect condition on one attribute (whose content is "got")
// for the verify_expected() below.
// This function returns true or false depending on whether the condition
@@ -306,9 +488,15 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
case comparison_operator_type::LT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_lt{});
case comparison_operator_type::LE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_le{});
case comparison_operator_type::GT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_gt{});
case comparison_operator_type::GE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_ge{});
case comparison_operator_type::BEGINS_WITH:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_BEGINS_WITH(got, (*attribute_value_list)[0]);
@@ -321,23 +509,29 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
case comparison_operator_type::NOT_NULL:
verify_operand_count(attribute_value_list, empty(), *comparison_operator);
return check_NOT_NULL(got);
default:
// FIXME: implement all the missing types, so there will be no default here.
throw api_error("ValidationException", format("ComparisonOperator {} is not yet supported", *comparison_operator));
case comparison_operator_type::BETWEEN:
verify_operand_count(attribute_value_list, exact_size(2), *comparison_operator);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1]);
case comparison_operator_type::CONTAINS:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_CONTAINS(got, (*attribute_value_list)[0]);
case comparison_operator_type::NOT_CONTAINS:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_NOT_CONTAINS(got, (*attribute_value_list)[0]);
}
throw std::logic_error(format("Internal error: corrupted operator enum: {}", int(op)));
}
}
// Verify that the existing values of the item (previous_item) match the
// Check if the existing values of the item (previous_item) match the
// conditions given by the Expected and ConditionalOperator parameters
// (if they exist) in the request (an UpdateItem, PutItem or DeleteItem).
// This function will throw a ConditionalCheckFailedException API error
// if the values do not match the condition, or ValidationException if there
// This function can throw an ValidationException API error if there
// are errors in the format of the condition itself.
void verify_expected(const rjson::value& req, const std::unique_ptr<rjson::value>& previous_item) {
bool verify_expected(const rjson::value& req, const std::unique_ptr<rjson::value>& previous_item) {
const rjson::value* expected = rjson::find(req, "Expected");
if (!expected) {
return;
return true;
}
if (!expected->IsObject()) {
throw api_error("ValidationException", "'Expected' parameter, if given, must be an object");
@@ -366,22 +560,123 @@ void verify_expected(const rjson::value& req, const std::unique_ptr<rjson::value
for (auto it = expected->MemberBegin(); it != expected->MemberEnd(); ++it) {
const rjson::value* got = nullptr;
if (previous_item && previous_item->IsObject() && previous_item->HasMember("Item")) {
got = rjson::find((*previous_item)["Item"], rjson::string_ref_type(it->name.GetString()));
got = rjson::find((*previous_item)["Item"], rjson::to_string_view(it->name));
}
bool success = verify_expected_one(it->value, got);
if (success && !require_all) {
// When !require_all, one success is enough!
return;
return true;
} else if (!success && require_all) {
// When require_all, one failure is enough!
throw api_error("ConditionalCheckFailedException", "Failed condition.");
return false;
}
}
// If we got here and require_all, none of the checks failed, so succeed.
// If we got here and !require_all, all of the checks failed, so fail.
if (!require_all) {
throw api_error("ConditionalCheckFailedException", "None of ORed Expect conditions were successful.");
return require_all;
}
bool calculate_primitive_condition(const parsed::primitive_condition& cond,
std::unordered_set<std::string>& used_attribute_values,
std::unordered_set<std::string>& used_attribute_names,
const rjson::value& req,
schema_ptr schema,
const std::unique_ptr<rjson::value>& previous_item) {
std::vector<rjson::value> calculated_values;
calculated_values.reserve(cond._values.size());
for (const parsed::value& v : cond._values) {
calculated_values.push_back(calculate_value(v,
cond._op == parsed::primitive_condition::type::VALUE ?
calculate_value_caller::ConditionExpressionAlone :
calculate_value_caller::ConditionExpression,
rjson::find(req, "ExpressionAttributeValues"),
used_attribute_names, used_attribute_values,
req, schema, previous_item));
}
switch (cond._op) {
case parsed::primitive_condition::type::BETWEEN:
if (calculated_values.size() != 3) {
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Wrong number of values {} in BETWEEN primitive_condition", cond._values.size()));
}
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2]);
case parsed::primitive_condition::type::IN:
return check_IN(calculated_values);
case parsed::primitive_condition::type::VALUE:
if (calculated_values.size() != 1) {
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Unexpected values in primitive_condition", cond._values.size()));
}
// Unwrap the boolean wrapped as the value (if it is a boolean)
if (calculated_values[0].IsObject() && calculated_values[0].MemberCount() == 1) {
auto it = calculated_values[0].MemberBegin();
if (it->name == "BOOL" && it->value.IsBool()) {
return it->value.GetBool();
}
}
throw api_error("ValidationException",
format("ConditionExpression: condition results in a non-boolean value: {}",
calculated_values[0]));
default:
// All the rest of the operators have exactly two parameters (and unless
// we have a bug in the parser, that's what we have in the parsed object:
if (calculated_values.size() != 2) {
throw std::logic_error(format("Wrong number of values {} in primitive_condition object", cond._values.size()));
}
}
switch (cond._op) {
case parsed::primitive_condition::type::EQ:
return check_EQ(&calculated_values[0], calculated_values[1]);
case parsed::primitive_condition::type::NE:
return check_NE(&calculated_values[0], calculated_values[1]);
case parsed::primitive_condition::type::GT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{});
case parsed::primitive_condition::type::GE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{});
case parsed::primitive_condition::type::LT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{});
case parsed::primitive_condition::type::LE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{});
default:
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Unknown type {} in primitive_condition object", (int)(cond._op)));
}
}
// Check if the existing values of the item (previous_item) match the
// conditions given by the given parsed ConditionExpression.
bool verify_condition_expression(
const parsed::condition_expression& condition_expression,
std::unordered_set<std::string>& used_attribute_values,
std::unordered_set<std::string>& used_attribute_names,
const rjson::value& req,
schema_ptr schema,
const std::unique_ptr<rjson::value>& previous_item) {
if (condition_expression.empty()) {
return true;
}
bool ret = std::visit(overloaded_functor {
[&] (const parsed::primitive_condition& cond) -> bool {
return calculate_primitive_condition(cond, used_attribute_values,
used_attribute_names, req, schema, previous_item);
},
[&] (const parsed::condition_expression::condition_list& list) -> bool {
auto verify_condition = [&] (const parsed::condition_expression& e) {
return verify_condition_expression(e, used_attribute_values,
used_attribute_names, req, schema, previous_item);
};
switch (list.op) {
case '&':
return boost::algorithm::all_of(list.conditions, verify_condition);
case '|':
return boost::algorithm::any_of(list.conditions, verify_condition);
default:
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error("bad operator in condition_list");
}
}
}, condition_expression._expression);
return condition_expression._negated ? !ret : ret;
}
}

View File

@@ -37,13 +37,13 @@
namespace alternator {
enum class comparison_operator_type {
EQ, NE, LE, LT, GE, GT, IN, BETWEEN, CONTAINS, IS_NULL, NOT_NULL, BEGINS_WITH
EQ, NE, LE, LT, GE, GT, IN, BETWEEN, CONTAINS, NOT_CONTAINS, IS_NULL, NOT_NULL, BEGINS_WITH
};
comparison_operator_type get_comparison_operator(const rjson::value& comparison_operator);
::shared_ptr<cql3::restrictions::statement_restrictions> get_filtering_restrictions(schema_ptr schema, const column_definition& attrs_col, const rjson::value& query_filter);
void verify_expected(const rjson::value& req, const std::unique_ptr<rjson::value>& previous_item);
bool verify_expected(const rjson::value& req, const std::unique_ptr<rjson::value>& previous_item);
}

File diff suppressed because it is too large Load Diff

View File

@@ -25,47 +25,58 @@
#include <seastar/http/httpd.hh>
#include "seastarx.hh"
#include <seastar/json/json_elements.hh>
#include <seastar/core/sharded.hh>
#include "service/storage_proxy.hh"
#include "service/migration_manager.hh"
#include "service/client_state.hh"
#include "alternator/error.hh"
#include "stats.hh"
#include "rjson.hh"
namespace alternator {
class executor {
class executor : public peering_sharded_service<executor> {
service::storage_proxy& _proxy;
service::migration_manager& _mm;
// An smp_service_group to be used for limiting the concurrency when
// forwarding Alternator request between shards - if necessary for LWT.
smp_service_group _ssg;
public:
using client_state = service::client_state;
using request_return_type = std::variant<json::json_return_type, api_error>;
stats _stats;
static constexpr auto ATTRS_COLUMN_NAME = ":attrs";
static constexpr auto KEYSPACE_NAME = "alternator";
static constexpr auto KEYSPACE_NAME_PREFIX = "alternator_";
executor(service::storage_proxy& proxy, service::migration_manager& mm) : _proxy(proxy), _mm(mm) {}
executor(service::storage_proxy& proxy, service::migration_manager& mm, smp_service_group ssg)
: _proxy(proxy), _mm(mm), _ssg(ssg) {}
future<json::json_return_type> create_table(client_state& client_state, std::string content);
future<json::json_return_type> describe_table(client_state& client_state, std::string content);
future<json::json_return_type> delete_table(client_state& client_state, std::string content);
future<json::json_return_type> put_item(client_state& client_state, std::string content);
future<json::json_return_type> get_item(client_state& client_state, std::string content);
future<json::json_return_type> delete_item(client_state& client_state, std::string content);
future<json::json_return_type> update_item(client_state& client_state, std::string content);
future<json::json_return_type> list_tables(client_state& client_state, std::string content);
future<json::json_return_type> scan(client_state& client_state, std::string content);
future<json::json_return_type> describe_endpoints(client_state& client_state, std::string content, std::string host_header);
future<json::json_return_type> batch_write_item(client_state& client_state, std::string content);
future<json::json_return_type> batch_get_item(client_state& client_state, std::string content);
future<json::json_return_type> query(client_state& client_state, std::string content);
future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> delete_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> put_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> get_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> delete_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> update_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> list_tables(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> scan(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> describe_endpoints(client_state& client_state, service_permit permit, rjson::value request, std::string host_header);
future<request_return_type> batch_write_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> batch_get_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> query(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> tag_resource(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> untag_resource(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> list_tags_of_resource(client_state& client_state, service_permit permit, rjson::value request);
future<> start();
future<> stop() { return make_ready_future<>(); }
future<> maybe_create_keyspace();
future<> create_keyspace(std::string_view keyspace_name);
static void maybe_trace_query(client_state& client_state, sstring_view op, sstring_view query);
static tracing::trace_state_ptr maybe_trace_query(client_state& client_state, sstring_view op, sstring_view query);
};
}

View File

@@ -22,6 +22,7 @@
#include "expressions.hh"
#include "alternator/expressionsLexer.hpp"
#include "alternator/expressionsParser.hpp"
#include "utils/overloaded_functor.hh"
#include <seastarx.hh>
@@ -65,13 +66,19 @@ parse_projection_expression(std::string query) {
}
}
template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
parsed::condition_expression
parse_condition_expression(std::string query) {
try {
return do_with_parser(query, std::mem_fn(&expressionsParser::condition_expression));
} catch (...) {
throw expressions_syntax_error(format("Failed parsing ConditionExpression '{}': {}", query, std::current_exception()));
}
}
namespace parsed {
void update_expression::add(update_expression::action a) {
std::visit(overloaded {
std::visit(overloaded_functor {
[&] (action::set&) { seen_set = true; },
[&] (action::remove&) { seen_remove = true; },
[&] (action::add&) { seen_add = true; },
@@ -94,5 +101,27 @@ void update_expression::append(update_expression other) {
seen_del |= other.seen_del;
}
void condition_expression::append(condition_expression&& a, char op) {
std::visit(overloaded_functor {
[&] (condition_list& x) {
// If 'a' has a single condition, we could, instead of inserting
// it insert its single condition (possibly negated if a._negated)
// But considering it we don't evaluate these expressions many
// times, this optimization is not worth extra code complexity.
if (!x.conditions.empty() && x.op != op) {
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error("condition_expression::append called with mixed operators");
}
x.conditions.push_back(std::move(a));
x.op = op;
},
[&] (primitive_condition& x) {
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error("condition_expression::append called on primitive_condition");
}
}, _expression);
}
} // namespace parsed
} // namespace alternator

View File

@@ -145,6 +145,12 @@ REMOVE: R E M O V E;
ADD: A D D;
DELETE: D E L E T E;
AND: A N D;
OR: O R;
NOT: N O T;
BETWEEN: B E T W E E N;
IN: I N;
fragment ALPHA: 'A'..'Z' | 'a'..'z';
fragment DIGIT: '0'..'9';
fragment ALNUM: ALPHA | DIGIT | '_';
@@ -165,19 +171,19 @@ path returns [parsed::path p]:
| '[' INTEGER ']' { $p.add_index(std::stoi($INTEGER.text)); }
)*;
update_expression_set_value returns [parsed::value v]:
VALREF { $v.set_valref($VALREF.text); }
| path { $v.set_path($path.p); }
| NAME { $v.set_func_name($NAME.text); }
'(' x=update_expression_set_value { $v.add_func_parameter($x.v); }
(',' x=update_expression_set_value { $v.add_func_parameter($x.v); })*
value returns [parsed::value v]:
VALREF { $v.set_valref($VALREF.text); }
| path { $v.set_path($path.p); }
| NAME { $v.set_func_name($NAME.text); }
'(' x=value { $v.add_func_parameter($x.v); }
(',' x=value { $v.add_func_parameter($x.v); })*
')'
;
update_expression_set_rhs returns [parsed::set_rhs rhs]:
v=update_expression_set_value { $rhs.set_value(std::move($v.v)); }
( '+' v=update_expression_set_value { $rhs.set_plus(std::move($v.v)); }
| '-' v=update_expression_set_value { $rhs.set_minus(std::move($v.v)); }
v=value { $rhs.set_value(std::move($v.v)); }
( '+' v=value { $rhs.set_plus(std::move($v.v)); }
| '-' v=value { $rhs.set_minus(std::move($v.v)); }
)?
;
@@ -212,3 +218,48 @@ update_expression returns [parsed::update_expression e]:
projection_expression returns [std::vector<parsed::path> v]:
p=path { $v.push_back(std::move($p.p)); }
(',' p=path { $v.push_back(std::move($p.p)); } )* EOF;
primitive_condition returns [parsed::primitive_condition c]:
v=value { $c.add_value(std::move($v.v));
$c.set_operator(parsed::primitive_condition::type::VALUE); }
( ( '=' { $c.set_operator(parsed::primitive_condition::type::EQ); }
| '<' '>' { $c.set_operator(parsed::primitive_condition::type::NE); }
| '<' { $c.set_operator(parsed::primitive_condition::type::LT); }
| '<' '=' { $c.set_operator(parsed::primitive_condition::type::LE); }
| '>' { $c.set_operator(parsed::primitive_condition::type::GT); }
| '>' '=' { $c.set_operator(parsed::primitive_condition::type::GE); }
)
v=value { $c.add_value(std::move($v.v)); }
| BETWEEN { $c.set_operator(parsed::primitive_condition::type::BETWEEN); }
v=value { $c.add_value(std::move($v.v)); }
AND
v=value { $c.add_value(std::move($v.v)); }
| IN '(' { $c.set_operator(parsed::primitive_condition::type::IN); }
v=value { $c.add_value(std::move($v.v)); }
(',' v=value { $c.add_value(std::move($v.v)); })*
')'
)?
;
// The following rules for parsing boolean expressions are verbose and
// somewhat strange because of Antlr 3's limitations on recursive rules,
// common rule prefixes, and (lack of) support for operator precedence.
// These rules could have been written more clearly using a more powerful
// parser generator - such as Yacc.
boolean_expression returns [parsed::condition_expression e]:
b=boolean_expression_1 { $e.append(std::move($b.e), '|'); }
(OR b=boolean_expression_1 { $e.append(std::move($b.e), '|'); } )*
;
boolean_expression_1 returns [parsed::condition_expression e]:
b=boolean_expression_2 { $e.append(std::move($b.e), '&'); }
(AND b=boolean_expression_2 { $e.append(std::move($b.e), '&'); } )*
;
boolean_expression_2 returns [parsed::condition_expression e]:
p=primitive_condition { $e.set_primitive(std::move($p.c)); }
| NOT b=boolean_expression_2 { $e = std::move($b.e); $e.apply_not(); }
| '(' b=boolean_expression ')' { $e = std::move($b.e); }
;
condition_expression returns [parsed::condition_expression e]:
boolean_expression { e=std::move($boolean_expression.e); } EOF;

View File

@@ -36,6 +36,6 @@ public:
parsed::update_expression parse_update_expression(std::string query);
std::vector<parsed::path> parse_projection_expression(std::string query);
parsed::condition_expression parse_condition_expression(std::string query);
} /* namespace alternator */

View File

@@ -0,0 +1,78 @@
/*
* Copyright 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <string>
#include <unordered_set>
#include "rjson.hh"
#include "schema_fwd.hh"
#include "expressions_types.hh"
namespace alternator {
// calculate_value() behaves slightly different (especially, different
// functions supported) when used in different types of expressions, as
// enumerated in this enum:
enum class calculate_value_caller {
UpdateExpression, ConditionExpression, ConditionExpressionAlone
};
inline std::ostream& operator<<(std::ostream& out, calculate_value_caller caller) {
switch (caller) {
case calculate_value_caller::UpdateExpression:
out << "UpdateExpression";
break;
case calculate_value_caller::ConditionExpression:
out << "ConditionExpression";
break;
case calculate_value_caller::ConditionExpressionAlone:
out << "ConditionExpression";
break;
default:
out << "unknown type of expression";
break;
}
return out;
}
bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2);
rjson::value calculate_value(const parsed::value& v,
calculate_value_caller caller,
const rjson::value* expression_attribute_values,
std::unordered_set<std::string>& used_attribute_names,
std::unordered_set<std::string>& used_attribute_values,
const rjson::value& update_info,
schema_ptr schema,
const std::unique_ptr<rjson::value>& previous_item);
bool verify_condition_expression(
const parsed::condition_expression& condition_expression,
std::unordered_set<std::string>& used_attribute_values,
std::unordered_set<std::string>& used_attribute_names,
const rjson::value& req,
schema_ptr schema,
const std::unique_ptr<rjson::value>& previous_item);
} /* namespace alternator */

View File

@@ -88,6 +88,15 @@ struct value {
void add_func_parameter(value v) {
std::get<function_call>(_value)._parameters.emplace_back(std::move(v));
}
bool is_valref() const {
return std::holds_alternative<std::string>(_value);
}
bool is_path() const {
return std::holds_alternative<path>(_value);
}
bool is_func() const {
return std::holds_alternative<function_call>(_value);
}
};
// The right-hand-side of a SET in an update expression can be either a
@@ -162,5 +171,58 @@ public:
}
};
// A primitive_condition is a condition expression involving one condition,
// while the full condition_expression below adds boolean logic over these
// primitive conditions.
// The supported primitive conditions are:
// 1. Binary operators - v1 OP v2, where OP is =, <>, <, <=, >, or >= and
// v1 and v2 are values - from the item (an attribute path), the query
// (a ":val" reference), or a function of the the above (only the size()
// function is supported).
// 2. Ternary operator - v1 BETWEEN v2 and v3 (means v1 >= v2 AND v1 <= v3).
// 3. N-ary operator - v1 IN ( v2, v3, ... )
// 4. A single function call (attribute_exists etc.). The parser actually
// accepts a more general "value" here but later stages reject a value
// which is not a function call (because DynamoDB does it too).
class primitive_condition {
public:
enum class type {
UNDEFINED, VALUE, EQ, NE, LT, LE, GT, GE, BETWEEN, IN
};
type _op = type::UNDEFINED;
std::vector<value> _values;
void set_operator(type op) {
_op = op;
}
void add_value(value&& v) {
_values.push_back(std::move(v));
}
bool empty() const {
return _op == type::UNDEFINED;
}
};
class condition_expression {
public:
bool _negated = false; // If true, the entire condition is negated
struct condition_list {
char op = '|'; // '&' or '|'
std::vector<condition_expression> conditions;
};
std::variant<primitive_condition, condition_list> _expression = condition_list();
void set_primitive(primitive_condition&& p) {
_expression = std::move(p);
}
void append(condition_expression&& c, char op);
void apply_not() {
_negated = !_negated;
}
bool empty() const {
return std::holds_alternative<condition_list>(_expression) &&
std::get<condition_list>(_expression).conditions.empty();
}
};
} // namespace parsed
} // namespace alternator

View File

@@ -22,14 +22,108 @@
#include "rjson.hh"
#include "error.hh"
#include <seastar/core/print.hh>
#include <seastar/core/thread.hh>
namespace rjson {
static allocator the_allocator;
/*
* This wrapper class adds nested level checks to rapidjson's handlers.
* Each rapidjson handler implements functions for accepting JSON values,
* which includes strings, numbers, objects, arrays, etc.
* Parsing objects and arrays needs to be performed carefully with regard
* to stack overflow - each object/array layer adds another stack frame
* to parsing, printing and destroying the parent JSON document.
* To prevent stack overflow, a rapidjson handler can be wrapped with
* guarded_json_handler, which accepts an additional max_nested_level parameter.
* After trying to exceed the max nested level, a proper rjson::error will be thrown.
*/
template<typename Handler, bool EnableYield>
struct guarded_yieldable_json_handler : public Handler {
size_t _nested_level = 0;
size_t _max_nested_level;
public:
using handler_base = Handler;
explicit guarded_yieldable_json_handler(size_t max_nested_level) : _max_nested_level(max_nested_level) {}
guarded_yieldable_json_handler(string_buffer& buf, size_t max_nested_level)
: handler_base(buf), _max_nested_level(max_nested_level) {}
void Parse(const char* str, size_t length) {
rapidjson::MemoryStream ms(static_cast<const char*>(str), length * sizeof(typename encoding::Ch));
rapidjson::EncodedInputStream<encoding, rapidjson::MemoryStream> is(ms);
rapidjson::GenericReader<encoding, encoding, allocator> reader(&the_allocator);
reader.Parse(is, *this);
if (reader.HasParseError()) {
throw rjson::error(format("Parsing JSON failed: {}", rapidjson::GetParseError_En(reader.GetParseErrorCode())));
}
//NOTICE: The handler has parsed the string, but in case of rapidjson::GenericDocument
// the data now resides in an internal stack_ variable, which is private instead of
// protected... which means we cannot simply access its data. Fortunately, another
// function for populating documents from SAX events can be abused to extract the data
// from the stack via gadget-oriented programming - we use an empty event generator
// which does nothing, and use it to call Populate(), which assumes that the generator
// will fill the stack with something. It won't, but our stack is already filled with
// data we want to steal, so once Populate() ends, our document will be properly parsed.
// A proper solution could be programmed once rapidjson declares this stack_ variable
// as protected instead of private, so that this class can access it.
auto dummy_generator = [](handler_base&){return true;};
handler_base::Populate(dummy_generator);
}
bool StartObject() {
++_nested_level;
check_nested_level();
maybe_yield();
return handler_base::StartObject();
}
bool EndObject(rapidjson::SizeType elements_count = 0) {
--_nested_level;
return handler_base::EndObject(elements_count);
}
bool StartArray() {
++_nested_level;
check_nested_level();
maybe_yield();
return handler_base::StartArray();
}
bool EndArray(rapidjson::SizeType elements_count = 0) {
--_nested_level;
return handler_base::EndArray(elements_count);
}
bool Null() { maybe_yield(); return handler_base::Null(); }
bool Bool(bool b) { maybe_yield(); return handler_base::Bool(b); }
bool Int(int i) { maybe_yield(); return handler_base::Int(i); }
bool Uint(unsigned u) { maybe_yield(); return handler_base::Uint(u); }
bool Int64(int64_t i64) { maybe_yield(); return handler_base::Int64(i64); }
bool Uint64(uint64_t u64) { maybe_yield(); return handler_base::Uint64(u64); }
bool Double(double d) { maybe_yield(); return handler_base::Double(d); }
bool String(const value::Ch* str, size_t length, bool copy = false) { maybe_yield(); return handler_base::String(str, length, copy); }
bool Key(const value::Ch* str, size_t length, bool copy = false) { maybe_yield(); return handler_base::Key(str, length, copy); }
protected:
static void maybe_yield() {
if constexpr (EnableYield) {
thread::maybe_yield();
}
}
void check_nested_level() const {
if (RAPIDJSON_UNLIKELY(_nested_level > _max_nested_level)) {
throw rjson::error(format("Max nested level reached: {}", _max_nested_level));
}
}
};
std::string print(const rjson::value& value) {
string_buffer buffer;
writer writer(buffer);
guarded_yieldable_json_handler<writer, false> writer(buffer, 39);
value.Accept(writer);
return std::string(buffer.GetString());
}
@@ -38,13 +132,9 @@ rjson::value copy(const rjson::value& value) {
return rjson::value(value, the_allocator);
}
rjson::value parse(const std::string& str) {
return parse_raw(str.c_str(), str.size());
}
rjson::value parse_raw(const char* c_str, size_t size) {
rjson::document d;
d.Parse(c_str, size);
rjson::value parse(std::string_view str) {
guarded_yieldable_json_handler<document, false> d(39);
d.Parse(str.data(), str.size());
if (d.HasParseError()) {
throw rjson::error(format("Parsing JSON failed: {}", GetParseError_En(d.GetParseError())));
}
@@ -52,8 +142,22 @@ rjson::value parse_raw(const char* c_str, size_t size) {
return std::move(v);
}
rjson::value& get(rjson::value& value, rjson::string_ref_type name) {
auto member_it = value.FindMember(name);
rjson::value parse_yieldable(std::string_view str) {
guarded_yieldable_json_handler<document, true> d(39);
d.Parse(str.data(), str.size());
if (d.HasParseError()) {
throw rjson::error(format("Parsing JSON failed: {}", GetParseError_En(d.GetParseError())));
}
rjson::value& v = d;
return std::move(v);
}
rjson::value& get(rjson::value& value, std::string_view name) {
// Although FindMember() has a variant taking a StringRef, it ignores the
// given length (see https://github.com/Tencent/rapidjson/issues/1649).
// Luckily, the variant taking a GenericValue doesn't share this bug,
// and we can create a string GenericValue without copying the string.
auto member_it = value.FindMember(rjson::value(name.data(), name.size()));
if (member_it != value.MemberEnd())
return member_it->value;
else {
@@ -61,8 +165,8 @@ rjson::value& get(rjson::value& value, rjson::string_ref_type name) {
}
}
const rjson::value& get(const rjson::value& value, rjson::string_ref_type name) {
auto member_it = value.FindMember(name);
const rjson::value& get(const rjson::value& value, std::string_view name) {
auto member_it = value.FindMember(rjson::value(name.data(), name.size()));
if (member_it != value.MemberEnd())
return member_it->value;
else {
@@ -82,24 +186,48 @@ rjson::value from_string(const char* str, size_t size) {
return rjson::value(str, size, the_allocator);
}
const rjson::value* find(const rjson::value& value, string_ref_type name) {
auto member_it = value.FindMember(name);
rjson::value from_string(std::string_view view) {
return rjson::value(view.data(), view.size(), the_allocator);
}
const rjson::value* find(const rjson::value& value, std::string_view name) {
// Although FindMember() has a variant taking a StringRef, it ignores the
// given length (see https://github.com/Tencent/rapidjson/issues/1649).
// Luckily, the variant taking a GenericValue doesn't share this bug,
// and we can create a string GenericValue without copying the string.
auto member_it = value.FindMember(rjson::value(name.data(), name.size()));
return member_it != value.MemberEnd() ? &member_it->value : nullptr;
}
rjson::value* find(rjson::value& value, string_ref_type name) {
auto member_it = value.FindMember(name);
rjson::value* find(rjson::value& value, std::string_view name) {
auto member_it = value.FindMember(rjson::value(name.data(), name.size()));
return member_it != value.MemberEnd() ? &member_it->value : nullptr;
}
bool remove_member(rjson::value& value, std::string_view name) {
// Although RemoveMember() has a variant taking a StringRef, it ignores
// given length (see https://github.com/Tencent/rapidjson/issues/1649).
// Luckily, the variant taking a GenericValue doesn't share this bug,
// and we can create a string GenericValue without copying the string.
return value.RemoveMember(rjson::value(name.data(), name.size()));
}
void set_with_string_name(rjson::value& base, const std::string& name, rjson::value&& member) {
base.AddMember(rjson::value(name.c_str(), name.size(), the_allocator), std::move(member), the_allocator);
}
void set_with_string_name(rjson::value& base, std::string_view name, rjson::value&& member) {
base.AddMember(rjson::value(name.data(), name.size(), the_allocator), std::move(member), the_allocator);
}
void set_with_string_name(rjson::value& base, const std::string& name, rjson::string_ref_type member) {
base.AddMember(rjson::value(name.c_str(), name.size(), the_allocator), rjson::value(member), the_allocator);
}
void set_with_string_name(rjson::value& base, std::string_view name, rjson::string_ref_type member) {
base.AddMember(rjson::value(name.data(), name.size(), the_allocator), rjson::value(member), the_allocator);
}
void set(rjson::value& base, rjson::string_ref_type name, rjson::value&& member) {
base.AddMember(name, std::move(member), the_allocator);
}
@@ -113,6 +241,58 @@ void push_back(rjson::value& base_array, rjson::value&& item) {
}
bool single_value_comp::operator()(const rjson::value& r1, const rjson::value& r2) const {
auto r1_type = r1.GetType();
auto r2_type = r2.GetType();
// null is the smallest type and compares with every other type, nothing is lesser than null
if (r1_type == rjson::type::kNullType || r2_type == rjson::type::kNullType) {
return r1_type < r2_type;
}
// only null, true, and false are comparable with each other, other types are not compatible
if (r1_type != r2_type) {
if (r1_type > rjson::type::kTrueType || r2_type > rjson::type::kTrueType) {
throw rjson::error(format("Types are not comparable: {} {}", r1, r2));
}
}
switch (r1_type) {
case rjson::type::kNullType:
// fall-through
case rjson::type::kFalseType:
// fall-through
case rjson::type::kTrueType:
return r1_type < r2_type;
case rjson::type::kObjectType:
throw rjson::error("Object type comparison is not supported");
case rjson::type::kArrayType:
throw rjson::error("Array type comparison is not supported");
case rjson::type::kStringType: {
const size_t r1_len = r1.GetStringLength();
const size_t r2_len = r2.GetStringLength();
size_t len = std::min(r1_len, r2_len);
int result = std::strncmp(r1.GetString(), r2.GetString(), len);
return result < 0 || (result == 0 && r1_len < r2_len);
}
case rjson::type::kNumberType: {
if (r1.IsInt() && r2.IsInt()) {
return r1.GetInt() < r2.GetInt();
} else if (r1.IsUint() && r2.IsUint()) {
return r1.GetUint() < r2.GetUint();
} else if (r1.IsInt64() && r2.IsInt64()) {
return r1.GetInt64() < r2.GetInt64();
} else if (r1.IsUint64() && r2.IsUint64()) {
return r1.GetUint64() < r2.GetUint64();
} else {
// it's safe to call GetDouble() on any number type
return r1.GetDouble() < r2.GetDouble();
}
}
default:
return false;
}
}
} // end namespace rjson
std::ostream& std::operator<<(std::ostream& os, const rjson::value& v) {

View File

@@ -104,38 +104,49 @@ inline rjson::value empty_string() {
// The representation is dense - without any redundant indentation.
std::string print(const rjson::value& value);
// Returns a string_view to the string held in a JSON value (which is
// assumed to hold a string, i.e., v.IsString() == true). This is a view
// to the existing data - no copying is done.
inline std::string_view to_string_view(const rjson::value& v) {
return std::string_view(v.GetString(), v.GetStringLength());
}
// Copies given JSON value - involves allocation
rjson::value copy(const rjson::value& value);
// Parses a JSON value from given string or raw character array.
// The string/char array liveness does not need to be persisted,
// as both parse() and parse_raw() will allocate member names and values.
// as parse() will allocate member names and values.
// Throws rjson::error if parsing failed.
rjson::value parse(const std::string& str);
rjson::value parse_raw(const char* c_str, size_t size);
rjson::value parse(std::string_view str);
// Needs to be run in thread context
rjson::value parse_yieldable(std::string_view str);
// Creates a JSON value (of JSON string type) out of internal string representations.
// The string value is copied, so str's liveness does not need to be persisted.
rjson::value from_string(const std::string& str);
rjson::value from_string(const sstring& str);
rjson::value from_string(const char* str, size_t size);
rjson::value from_string(std::string_view view);
// Returns a pointer to JSON member if it exists, nullptr otherwise
rjson::value* find(rjson::value& value, rjson::string_ref_type name);
const rjson::value* find(const rjson::value& value, rjson::string_ref_type name);
rjson::value* find(rjson::value& value, std::string_view name);
const rjson::value* find(const rjson::value& value, std::string_view name);
// Returns a reference to JSON member if it exists, throws otherwise
rjson::value& get(rjson::value& value, rjson::string_ref_type name);
const rjson::value& get(const rjson::value& value, rjson::string_ref_type name);
rjson::value& get(rjson::value& value, std::string_view name);
const rjson::value& get(const rjson::value& value, std::string_view name);
// Sets a member in given JSON object by moving the member - allocates the name.
// Throws if base is not a JSON object.
void set_with_string_name(rjson::value& base, const std::string& name, rjson::value&& member);
void set_with_string_name(rjson::value& base, std::string_view name, rjson::value&& member);
// Sets a string member in given JSON object by assigning its reference - allocates the name.
// NOTICE: member string liveness must be ensured to be at least as long as base's.
// Throws if base is not a JSON object.
void set_with_string_name(rjson::value& base, const std::string& name, rjson::string_ref_type member);
void set_with_string_name(rjson::value& base, std::string_view name, rjson::string_ref_type member);
// Sets a member in given JSON object by moving the member.
// NOTICE: name liveness must be ensured to be at least as long as base's.
@@ -152,6 +163,13 @@ void set(rjson::value& base, rjson::string_ref_type name, rjson::string_ref_type
// Throws if base_array is not a JSON array.
void push_back(rjson::value& base_array, rjson::value&& item);
// Remove a member from a JSON object. Throws if value isn't an object.
bool remove_member(rjson::value& value, std::string_view name);
struct single_value_comp {
bool operator()(const rjson::value& r1, const rjson::value& r2) const;
};
} // end namespace rjson
namespace std {

124
alternator/rmw_operation.hh Normal file
View File

@@ -0,0 +1,124 @@
/*
* Copyright 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <seastarx.hh>
#include <service/storage_proxy.hh>
#include <service/storage_proxy.hh>
#include "rjson.hh"
#include "executor.hh"
namespace alternator {
// An rmw_operation encapsulates the common logic of all the item update
// operations which may involve a read of the item before the write
// (so-called Read-Modify-Write operations). These operations include PutItem,
// UpdateItem and DeleteItem: All of these may be conditional operations (the
// "Expected" parameter) which requir a read before the write, and UpdateItem
// may also have an update expression which refers to the item's old value.
//
// The code below supports running the read and the write together as one
// transaction using LWT (this is why rmw_operation is a subclass of
// cas_request, as required by storage_proxy::cas()), but also has optional
// modes not using LWT.
class rmw_operation : public service::cas_request, public enable_shared_from_this<rmw_operation> {
public:
// The following options choose which mechanism to use for isolating
// parallel write operations:
// * The FORBID_RMW option forbids RMW (read-modify-write) operations
// such as conditional updates. For the remaining write-only
// operations, ordinary quorum writes are isolated enough.
// * The LWT_ALWAYS option always uses LWT (lightweight transactions)
// for any write operation - whether or not it also has a read.
// * The LWT_RMW_ONLY option uses LWT only for RMW operations, and uses
// ordinary quorum writes for write-only operations.
// This option is not safe if the user may send both RMW and write-only
// operations on the same item.
// * The UNSAFE_RMW option does read-modify-write operations as separate
// read and write. It is unsafe - concurrent RMW operations are not
// isolated at all. This option will likely be removed in the future.
enum class write_isolation {
FORBID_RMW, LWT_ALWAYS, LWT_RMW_ONLY, UNSAFE_RMW
};
static constexpr auto WRITE_ISOLATION_TAG_KEY = "system:write_isolation";
static write_isolation get_write_isolation_for_schema(schema_ptr schema);
protected:
// The full request JSON
rjson::value _request;
// All RMW operations involve a single item with a specific partition
// and optional clustering key, in a single table, so the following
// information is common to all of them:
schema_ptr _schema;
partition_key _pk = partition_key::make_empty();
clustering_key _ck = clustering_key::make_empty();
write_isolation _write_isolation;
// All RMW operations can have a ReturnValues parameter from the following
// choices. But note that only UpdateItem actually supports all of them:
enum class returnvalues {
NONE, ALL_OLD, UPDATED_OLD, ALL_NEW, UPDATED_NEW
} _returnvalues;
static returnvalues parse_returnvalues(const rjson::value& request);
// When _returnvalues != NONE, apply() should store here, in JSON form,
// the values which are to be returned in the "Attributes" field.
// The default null JSON means do not return an Attributes field at all.
// This field is marked "mutable" so that the const apply() can modify
// it (see explanation below), but note that because apply() may be
// called more than once, if apply() will sometimes set this field it
// must set it (even if just to the default empty value) every time.
mutable rjson::value _return_attributes;
public:
// The constructor of a rmw_operation subclass should parse the request
// and try to discover as many input errors as it can before really
// attempting the read or write operations.
rmw_operation(service::storage_proxy& proxy, rjson::value&& request);
// rmw_operation subclasses (update_item_operation, put_item_operation
// and delete_item_operation) shall implement an apply() function which
// takes the previous value of the item (if it was read) and creates the
// write mutation. If the previous value of item does not pass the needed
// conditional expression, apply() should return an empty optional.
// apply() may throw if it encounters input errors not discovered during
// the constructor.
// apply() may be called more than once in case of contention, so it must
// not change the state saved in the object (issue #7218 was caused by
// violating this). We mark apply() "const" to let the compiler validate
// this for us. The output-only field _return_attributes is marked
// "mutable" above so that apply() can still write to it.
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const = 0;
// Convert the above apply() into the signature needed by cas_request:
virtual std::optional<mutation> apply(query::result& qr, const query::partition_slice& slice, api::timestamp_type ts) override;
virtual ~rmw_operation() = default;
schema_ptr schema() const { return _schema; }
const rjson::value& request() const { return _request; }
rjson::value&& move_request() && { return std::move(_request); }
future<executor::request_return_type> execute(service::storage_proxy& proxy,
service::client_state& client_state,
tracing::trace_state_ptr trace_state,
service_permit permit,
bool needs_read_before_write,
stats& stats);
std::optional<shard_id> shard_for_execute(bool needs_read_before_write);
};
} // namespace alternator

View File

@@ -25,6 +25,7 @@
#include "error.hh"
#include "rapidjson/writer.h"
#include "concrete_types.hh"
#include "cql3/type_json.hh"
static logging::logger slogger("alternator-serialization");
@@ -77,7 +78,7 @@ struct from_json_visitor {
}
// default
void operator()(const abstract_type& t) const {
bo.write(t.from_json_object(Json::Value(rjson::print(v)), cql_serialization_format::internal()));
bo.write(from_json_object(t, Json::Value(rjson::print(v)), cql_serialization_format::internal()));
}
};
@@ -107,7 +108,7 @@ struct to_json_visitor {
void operator()(const reversed_type_impl& t) const { visit(*t.underlying_type(), to_json_visitor{deserialized, type_ident, bv}); };
void operator()(const decimal_type_impl& t) const {
auto s = decimal_type->to_json_string(bytes(bv));
auto s = to_json_string(*decimal_type, bytes(bv));
//FIXME(sarna): unnecessary copy
rjson::set_with_string_name(deserialized, type_ident, rjson::from_string(s));
}
@@ -135,7 +136,7 @@ rjson::value deserialize_item(bytes_view bv) {
if (atype == alternator_type::NOT_SUPPORTED_YET) {
slogger.trace("Non-optimal deserialization of alternator type {}", int8_t(atype));
return rjson::parse_raw(reinterpret_cast<const char *>(bv.data()), bv.size());
return rjson::parse(std::string_view(reinterpret_cast<const char *>(bv.data()), bv.size()));
}
type_representation type_representation = represent_type(atype);
visit(*type_representation.dtype, to_json_visitor{deserialized, type_representation.ident, bv});
@@ -159,27 +160,34 @@ std::string type_to_string(data_type type) {
bytes get_key_column_value(const rjson::value& item, const column_definition& column) {
std::string column_name = column.name_as_text();
std::string expected_type = type_to_string(column.type);
const rjson::value& key_typed_value = rjson::get(item, rjson::value::StringRefType(column_name.c_str()));
if (!key_typed_value.IsObject() || key_typed_value.MemberCount() != 1) {
throw api_error("ValidationException",
format("Missing or invalid value object for key column {}: {}", column_name, item));
const rjson::value* key_typed_value = rjson::find(item, column_name);
if (!key_typed_value) {
throw api_error("ValidationException", format("Key column {} not found", column_name));
}
return get_key_from_typed_value(key_typed_value, column, expected_type);
return get_key_from_typed_value(*key_typed_value, column);
}
bytes get_key_from_typed_value(const rjson::value& key_typed_value, const column_definition& column, const std::string& expected_type) {
// Parses the JSON encoding for a key value, which is a map with a single
// entry, whose key is the type (expected to match the key column's type)
// and the value is the encoded value.
bytes get_key_from_typed_value(const rjson::value& key_typed_value, const column_definition& column) {
if (!key_typed_value.IsObject() || key_typed_value.MemberCount() != 1 ||
!key_typed_value.MemberBegin()->value.IsString()) {
throw api_error("ValidationException",
format("Malformed value object for key column {}: {}",
column.name_as_text(), key_typed_value));
}
auto it = key_typed_value.MemberBegin();
if (it->name.GetString() != expected_type) {
if (it->name != type_to_string(column.type)) {
throw api_error("ValidationException",
format("Type mismatch: expected type {} for key column {}, got type {}",
expected_type, column.name_as_text(), it->name.GetString()));
type_to_string(column.type), column.name_as_text(), it->name.GetString()));
}
if (column.type == bytes_type) {
return base64_decode(it->value);
} else {
return column.type->from_string(it->value.GetString());
return column.type->from_string(rjson::to_string_view(it->value));
}
}
@@ -194,7 +202,7 @@ rjson::value json_key_column_value(bytes_view cell, const column_definition& col
// FIXME: use specialized Alternator number type, not the more
// general "decimal_type". A dedicated type can be more efficient
// in storage space and in parsing speed.
auto s = decimal_type->to_json_string(bytes(cell));
auto s = to_json_string(*decimal_type, bytes(cell));
return rjson::from_string(s);
} else {
// We shouldn't get here, we shouldn't see such key columns.
@@ -245,4 +253,16 @@ big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic) {
return big_decimal(it->value.GetString());
}
const std::pair<std::string, const rjson::value*> unwrap_set(const rjson::value& v) {
if (!v.IsObject() || v.MemberCount() != 1) {
return {"", nullptr};
}
auto it = v.MemberBegin();
const std::string it_key = it->name.GetString();
if (it_key != "SS" && it_key != "BS" && it_key != "NS") {
return {"", nullptr};
}
return std::make_pair(it_key, &(it->value));
}
}

View File

@@ -24,7 +24,7 @@
#include <string>
#include <string_view>
#include "types.hh"
#include "schema.hh"
#include "schema_fwd.hh"
#include "keys.hh"
#include "rjson.hh"
#include "utils/big_decimal.hh"
@@ -54,7 +54,7 @@ rjson::value deserialize_item(bytes_view bv);
std::string type_to_string(data_type type);
bytes get_key_column_value(const rjson::value& item, const column_definition& column);
bytes get_key_from_typed_value(const rjson::value& key_typed_value, const column_definition& column, const std::string& expected_type);
bytes get_key_from_typed_value(const rjson::value& key_typed_value, const column_definition& column);
rjson::value json_key_column_value(bytes_view cell, const column_definition& column);
partition_key pk_from_json(const rjson::value& item, schema_ptr schema);
@@ -63,4 +63,10 @@ clustering_key ck_from_json(const rjson::value& item, schema_ptr schema);
// If v encodes a number (i.e., it is a {"N": [...]}, returns an object representing it. Otherwise,
// raises ValidationException with diagnostic.
big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic);
// Check if a given JSON object encodes a set (i.e., it is a {"SS": [...]}, or "NS", "BS"
// and returns set's type and a pointer to that set. If the object does not encode a set,
// returned value is {"", nullptr}
const std::pair<std::string, const rjson::value*> unwrap_set(const rjson::value& v);
}

View File

@@ -29,6 +29,8 @@
#include "auth.hh"
#include <cctype>
#include "cql3/query_processor.hh"
#include "service/storage_service.hh"
#include "utils/overloaded_functor.hh"
static logging::logger slogger("alternator-server");
@@ -65,9 +67,9 @@ inline std::vector<std::string_view> split(std::string_view text, char separator
// Internal Server Error.
class api_handler : public handler_base {
public:
api_handler(const future_json_function& _handle) : _f_handle(
[_handle](std::unique_ptr<request> req, std::unique_ptr<reply> rep) {
return seastar::futurize_apply(_handle, std::move(req)).then_wrapped([rep = std::move(rep)](future<json::json_return_type> resf) mutable {
api_handler(const std::function<future<executor::request_return_type>(std::unique_ptr<request> req)>& _handle) : _f_handle(
[this, _handle](std::unique_ptr<request> req, std::unique_ptr<reply> rep) {
return seastar::futurize_apply(_handle, std::move(req)).then_wrapped([this, rep = std::move(rep)](future<executor::request_return_type> resf) mutable {
if (resf.failed()) {
// Exceptions of type api_error are wrapped as JSON and
// returned to the client as expected. Other types of
@@ -86,20 +88,24 @@ public:
format("Internal server error: {}", std::current_exception()),
reply::status_type::internal_server_error);
}
// FIXME: what is this version number?
rep->_content += "{\"__type\":\"com.amazonaws.dynamodb.v20120810#" + ret._type + "\"," +
"\"message\":\"" + ret._msg + "\"}";
rep->_status = ret._http_code;
slogger.trace("api_handler error case: {}", rep->_content);
generate_error_reply(*rep, ret);
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
}
slogger.trace("api_handler success case");
auto res = resf.get0();
if (res._body_writer) {
rep->write_body("json", std::move(res._body_writer));
} else {
rep->_content += res._res;
}
std::visit(overloaded_functor {
[&] (const json::json_return_type& json_return_value) {
slogger.trace("api_handler success case");
if (json_return_value._body_writer) {
rep->write_body("json", std::move(json_return_value._body_writer));
} else {
rep->_content += json_return_value._res;
}
},
[&] (const api_error& err) {
generate_error_reply(*rep, err);
}
}, res);
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
});
}), _type("json") { }
@@ -115,18 +121,66 @@ public:
}
protected:
void generate_error_reply(reply& rep, const api_error& err) {
rep._content += "{\"__type\":\"com.amazonaws.dynamodb.v20120810#" + err._type + "\"," +
"\"message\":\"" + err._msg + "\"}";
rep._status = err._http_code;
slogger.trace("api_handler error case: {}", rep._content);
}
future_handler_function _f_handle;
sstring _type;
};
class health_handler : public handler_base {
virtual future<std::unique_ptr<reply>> handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
class gated_handler : public handler_base {
seastar::gate& _gate;
public:
gated_handler(seastar::gate& gate) : _gate(gate) {}
virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) = 0;
virtual future<std::unique_ptr<reply>> handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) final override {
return with_gate(_gate, [this, &path, req = std::move(req), rep = std::move(rep)] () mutable {
return do_handle(path, std::move(req), std::move(rep));
});
}
};
class health_handler : public gated_handler {
public:
health_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}
protected:
virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
rep->set_status(reply::status_type::ok);
rep->write_body("txt", format("healthy: {}", req->get_header("Host")));
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
}
};
class local_nodelist_handler : public gated_handler {
public:
local_nodelist_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}
protected:
virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
rjson::value results = rjson::empty_array();
// It's very easy to get a list of all live nodes on the cluster,
// using gms::get_local_gossiper().get_live_members(). But getting
// just the list of live nodes in this DC needs more elaborate code:
sstring local_dc = locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(
utils::fb_utilities::get_broadcast_address());
std::unordered_set<gms::inet_address> local_dc_nodes =
service::get_local_storage_service().get_token_metadata().
get_topology().get_datacenter_endpoints().at(local_dc);
for (auto& ip : local_dc_nodes) {
if (gms::get_local_gossiper().is_alive(ip)) {
rjson::push_back(results, rjson::from_string(ip.to_sstring()));
}
}
rep->set_status(reply::status_type::ok);
rep->set_content_type("json");
rep->_content = rjson::print(results);
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
}
};
future<> server::verify_signature(const request& req) {
if (!_enforce_authorization) {
slogger.debug("Skipping authorization");
@@ -137,7 +191,7 @@ future<> server::verify_signature(const request& req) {
throw api_error("InvalidSignatureException", "Host header is mandatory for signature verification");
}
auto authorization_it = req._headers.find("Authorization");
if (host_it == req._headers.end()) {
if (authorization_it == req._headers.end()) {
throw api_error("InvalidSignatureException", "Authorization header is mandatory for signature verification");
}
std::string host = host_it->second;
@@ -214,7 +268,8 @@ future<> server::verify_signature(const request& req) {
});
}
future<json::json_return_type> server::handle_api_request(std::unique_ptr<request>&& req) {
future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request>&& req) {
_executor._stats.total_operations++;
sstring target = req->get_header(TARGET);
std::vector<std::string_view> split_target = split(target, '.');
//NOTICE(sarna): Target consists of Dynamo API version followed by a dot '.' and operation type (e.g. CreateTable)
@@ -223,17 +278,32 @@ future<json::json_return_type> server::handle_api_request(std::unique_ptr<reques
return verify_signature(*req).then([this, op, req = std::move(req)] () mutable {
auto callback_it = _callbacks.find(op);
if (callback_it == _callbacks.end()) {
_executor.local()._stats.unsupported_operations++;
_executor._stats.unsupported_operations++;
throw api_error("UnknownOperationException",
format("Unsupported operation {}", op));
}
//FIXME: Client state can provide more context, e.g. client's endpoint address
// We use unique_ptr because client_state cannot be moved or copied
return do_with(std::make_unique<executor::client_state>(executor::client_state::internal_tag()), [this, callback_it = std::move(callback_it), op = std::move(op), req = std::move(req)] (std::unique_ptr<executor::client_state>& client_state) mutable {
client_state->set_raw_keyspace(executor::KEYSPACE_NAME);
executor::maybe_trace_query(*client_state, op, req->content);
tracing::trace(client_state->get_trace_state(), op);
return callback_it->second(_executor.local(), *client_state, std::move(req));
return with_gate(_pending_requests, [this, callback_it = std::move(callback_it), op = std::move(op), req = std::move(req)] () mutable {
//FIXME: Client state can provide more context, e.g. client's endpoint address
// We use unique_ptr because client_state cannot be moved or copied
return do_with(std::make_unique<executor::client_state>(executor::client_state::internal_tag()),
[this, callback_it = std::move(callback_it), op = std::move(op), req = std::move(req)] (std::unique_ptr<executor::client_state>& client_state) mutable {
tracing::trace_state_ptr trace_state = executor::maybe_trace_query(*client_state, op, req->content);
tracing::trace(trace_state, op);
// JSON parsing can allocate up to roughly 2x the size of the raw document, + a couple of bytes for maintenance.
// FIXME: by this time, the whole HTTP request was already read, so some memory is already occupied.
// Once HTTP allows working on streams, we should grab the permit *before* reading the HTTP payload.
size_t mem_estimate = req->content.size() * 3 + 8000;
auto units_fut = get_units(*_memory_limiter, mem_estimate);
if (_memory_limiter->waiters()) {
++_executor._stats.requests_blocked_memory;
}
return units_fut.then([this, callback_it = std::move(callback_it), &client_state, trace_state, req = std::move(req)] (semaphore_units<> units) mutable {
return _json_parser.parse(req->content).then([this, callback_it = std::move(callback_it), &client_state, trace_state,
units = std::move(units), req = std::move(req)] (rjson::value json_request) mutable {
return callback_it->second(_executor, *client_state, trace_state, make_service_permit(std::move(units)), std::move(json_request), std::move(req)).finally([trace_state] {});
});
});
});
});
});
}
@@ -243,35 +313,88 @@ void server::set_routes(routes& r) {
return handle_api_request(std::move(req));
});
r.add(operation_type::POST, url("/"), req_handler);
r.add(operation_type::GET, url("/"), new health_handler);
r.put(operation_type::POST, "/", req_handler);
r.put(operation_type::GET, "/", new health_handler(_pending_requests));
// The "/localnodes" request is a new Alternator feature, not supported by
// DynamoDB and not required for DynamoDB compatibility. It allows a
// client to enquire - using a trivial HTTP request without requiring
// authentication - the list of all live nodes in the same data center of
// the Alternator cluster. The client can use this list to balance its
// request load to all the nodes in the same geographical region.
// Note that this API exposes - openly without authentication - the
// information on the cluster's members inside one data center. We do not
// consider this to be a security risk, because an attacker can already
// scan an entire subnet for nodes responding to the health request,
// or even just scan for open ports.
r.put(operation_type::GET, "/localnodes", new local_nodelist_handler(_pending_requests));
}
//FIXME: A way to immediately invalidate the cache should be considered,
// e.g. when the system table which stores the keys is changed.
// For now, this propagation may take up to 1 minute.
server::server(seastar::sharded<executor>& e)
: _executor(e), _key_cache(1024, 1min, slogger), _enforce_authorization(false)
server::server(executor& exec)
: _http_server("http-alternator")
, _https_server("https-alternator")
, _executor(exec)
, _key_cache(1024, 1min, slogger)
, _enforce_authorization(false)
, _enabled_servers{}
, _pending_requests{}
, _callbacks{
{"CreateTable", [] (executor& e, executor::client_state& client_state, std::unique_ptr<request> req) {
return e.maybe_create_keyspace().then([&e, &client_state, req = std::move(req)] { return e.create_table(client_state, req->content); }); }
},
{"DescribeTable", [] (executor& e, executor::client_state& client_state, std::unique_ptr<request> req) { return e.describe_table(client_state, req->content); }},
{"DeleteTable", [] (executor& e, executor::client_state& client_state, std::unique_ptr<request> req) { return e.delete_table(client_state, req->content); }},
{"PutItem", [] (executor& e, executor::client_state& client_state, std::unique_ptr<request> req) { return e.put_item(client_state, req->content); }},
{"UpdateItem", [] (executor& e, executor::client_state& client_state, std::unique_ptr<request> req) { return e.update_item(client_state, req->content); }},
{"GetItem", [] (executor& e, executor::client_state& client_state, std::unique_ptr<request> req) { return e.get_item(client_state, req->content); }},
{"DeleteItem", [] (executor& e, executor::client_state& client_state, std::unique_ptr<request> req) { return e.delete_item(client_state, req->content); }},
{"ListTables", [] (executor& e, executor::client_state& client_state, std::unique_ptr<request> req) { return e.list_tables(client_state, req->content); }},
{"Scan", [] (executor& e, executor::client_state& client_state, std::unique_ptr<request> req) { return e.scan(client_state, req->content); }},
{"DescribeEndpoints", [] (executor& e, executor::client_state& client_state, std::unique_ptr<request> req) { return e.describe_endpoints(client_state, req->content, req->get_header("Host")); }},
{"BatchWriteItem", [] (executor& e, executor::client_state& client_state, std::unique_ptr<request> req) { return e.batch_write_item(client_state, req->content); }},
{"BatchGetItem", [] (executor& e, executor::client_state& client_state, std::unique_ptr<request> req) { return e.batch_get_item(client_state, req->content); }},
{"Query", [] (executor& e, executor::client_state& client_state, std::unique_ptr<request> req) { return e.query(client_state, req->content); }},
{"CreateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.create_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
{"DescribeTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.describe_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
{"DeleteTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.delete_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
{"PutItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.put_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
{"UpdateItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.update_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
{"GetItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.get_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
{"DeleteItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.delete_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
{"ListTables", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.list_tables(client_state, std::move(permit), std::move(json_request));
}},
{"Scan", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.scan(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
{"DescribeEndpoints", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.describe_endpoints(client_state, std::move(permit), std::move(json_request), req->get_header("Host"));
}},
{"BatchWriteItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.batch_write_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
{"BatchGetItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.batch_get_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
{"Query", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.query(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
{"TagResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.tag_resource(client_state, std::move(permit), std::move(json_request));
}},
{"UntagResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.untag_resource(client_state, std::move(permit), std::move(json_request));
}},
{"ListTagsOfResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.list_tags_of_resource(client_state, std::move(permit), std::move(json_request));
}},
} {
}
future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds, bool enforce_authorization) {
future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
bool enforce_authorization, semaphore* memory_limiter) {
_memory_limiter = memory_limiter;
_enforce_authorization = enforce_authorization;
if (!port && !https_port) {
return make_exception_future<>(std::runtime_error("Either regular port or TLS port"
@@ -279,33 +402,82 @@ future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std:
}
return seastar::async([this, addr, port, https_port, creds] {
try {
_executor.invoke_on_all([] (executor& e) {
return e.start();
}).get();
_executor.start().get();
if (port) {
_control.start().get();
_control.set_routes(std::bind(&server::set_routes, this, std::placeholders::_1)).get();
_control.listen(socket_address{addr, *port}).get();
set_routes(_http_server._routes);
_http_server.set_content_length_limit(server::content_length_limit);
_http_server.listen(socket_address{addr, *port}).get();
_enabled_servers.push_back(std::ref(_http_server));
slogger.info("Alternator HTTP server listening on {} port {}", addr, *port);
}
if (https_port) {
_https_control.start().get();
_https_control.set_routes(std::bind(&server::set_routes, this, std::placeholders::_1)).get();
_https_control.server().invoke_on_all([creds] (http_server& serv) {
return serv.set_tls_credentials(creds->build_server_credentials());
}).get();
_https_control.listen(socket_address{addr, *https_port}).get();
set_routes(_https_server._routes);
_https_server.set_content_length_limit(server::content_length_limit);
_https_server.set_tls_credentials(creds->build_server_credentials());
_https_server.listen(socket_address{addr, *https_port}).get();
_enabled_servers.push_back(std::ref(_https_server));
slogger.info("Alternator HTTPS server listening on {} port {}", addr, *https_port);
}
} catch (...) {
slogger.warn("Failed to set up Alternator HTTP server on {} port {}, TLS port {}: {}",
slogger.error("Failed to set up Alternator HTTP server on {} port {}, TLS port {}: {}",
addr, port ? std::to_string(*port) : "OFF", https_port ? std::to_string(*https_port) : "OFF", std::current_exception());
throw;
std::throw_with_nested(std::runtime_error(
format("Failed to set up Alternator HTTP server on {} port {}, TLS port {}",
addr, port ? std::to_string(*port) : "OFF", https_port ? std::to_string(*https_port) : "OFF")));
}
});
}
future<> server::stop() {
return parallel_for_each(_enabled_servers, [] (http_server& server) {
return server.stop();
}).then([this] {
return _pending_requests.close();
}).then([this] {
return _json_parser.stop();
});
}
server::json_parser::json_parser() : _run_parse_json_thread(async([this] {
while (true) {
_document_waiting.wait().get();
if (_as.abort_requested()) {
return;
}
try {
_parsed_document = rjson::parse_yieldable(_raw_document);
_current_exception = nullptr;
} catch (...) {
_current_exception = std::current_exception();
}
_document_parsed.signal();
}
})) {
}
future<rjson::value> server::json_parser::parse(std::string_view content) {
if (content.size() < yieldable_parsing_threshold) {
return make_ready_future<rjson::value>(rjson::parse(content));
}
return with_semaphore(_parsing_sem, 1, [this, content] {
_raw_document = content;
_document_waiting.signal();
return _document_parsed.wait().then([this] {
if (_current_exception) {
return make_exception_future<rjson::value>(_current_exception);
}
return make_ready_future<rjson::value>(std::move(_parsed_document));
});
});
}
future<> server::json_parser::stop() {
_as.request_abort();
_document_waiting.signal();
_document_parsed.broken();
return std::move(_run_parse_json_thread);
}
}

View File

@@ -27,27 +27,56 @@
#include <seastar/net/tls.hh>
#include <optional>
#include <alternator/auth.hh>
#include <utils/small_vector.hh>
#include <seastar/core/units.hh>
namespace alternator {
class server {
using alternator_callback = std::function<future<json::json_return_type>(executor&, executor::client_state&, std::unique_ptr<request>)>;
static constexpr size_t content_length_limit = 16*MB;
using alternator_callback = std::function<future<executor::request_return_type>(executor&, executor::client_state&,
tracing::trace_state_ptr, service_permit, rjson::value, std::unique_ptr<request>)>;
using alternator_callbacks_map = std::unordered_map<std::string_view, alternator_callback>;
seastar::httpd::http_server_control _control;
seastar::httpd::http_server_control _https_control;
seastar::sharded<executor>& _executor;
http_server _http_server;
http_server _https_server;
executor& _executor;
key_cache _key_cache;
bool _enforce_authorization;
utils::small_vector<std::reference_wrapper<seastar::httpd::http_server>, 2> _enabled_servers;
gate _pending_requests;
alternator_callbacks_map _callbacks;
public:
server(seastar::sharded<executor>& executor);
seastar::future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds, bool enforce_authorization);
semaphore* _memory_limiter;
class json_parser {
static constexpr size_t yieldable_parsing_threshold = 16*KB;
std::string_view _raw_document;
rjson::value _parsed_document;
std::exception_ptr _current_exception;
semaphore _parsing_sem{1};
condition_variable _document_waiting;
condition_variable _document_parsed;
abort_source _as;
future<> _run_parse_json_thread;
public:
json_parser();
future<rjson::value> parse(std::string_view content);
future<> stop();
};
json_parser _json_parser;
public:
server(executor& executor);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
bool enforce_authorization, semaphore* memory_limiter);
future<> stop();
private:
void set_routes(seastar::httpd::routes& r);
future<> verify_signature(const seastar::httpd::request& r);
future<json::json_return_type> handle_api_request(std::unique_ptr<request>&& req);
future<executor::request_return_type> handle_api_request(std::unique_ptr<request>&& req);
};
}

View File

@@ -85,6 +85,12 @@ stats::stats() : api_operations{} {
seastar::metrics::description("number of total operations via Alternator API")),
seastar::metrics::make_total_operations("reads_before_write", reads_before_write,
seastar::metrics::description("number of performed read-before-write operations")),
seastar::metrics::make_total_operations("write_using_lwt", write_using_lwt,
seastar::metrics::description("number of writes that used LWT")),
seastar::metrics::make_total_operations("shard_bounce_for_lwt", shard_bounce_for_lwt,
seastar::metrics::description("number writes that had to be bounced from this shard because of LWT requirements")),
seastar::metrics::make_total_operations("requests_blocked_memory", requests_blocked_memory,
seastar::metrics::description("Counts a number of requests blocked due to memory pressure.")),
seastar::metrics::make_total_operations("filtered_rows_read_total", cql_stats.filtered_rows_read_total,
seastar::metrics::description("number of rows read during filtering operations")),
seastar::metrics::make_total_operations("filtered_rows_matched_total", cql_stats.filtered_rows_matched_total,

View File

@@ -84,6 +84,9 @@ public:
uint64_t total_operations = 0;
uint64_t unsupported_operations = 0;
uint64_t reads_before_write = 0;
uint64_t write_using_lwt = 0;
uint64_t shard_bounce_for_lwt = 0;
uint64_t requests_blocked_memory = 0;
// CQL-derived stats
cql3::cql_stats cql_stats;
private:

View File

@@ -0,0 +1,53 @@
/*
* Copyright 2019 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "serializer.hh"
#include "schema.hh"
#include "db/extensions.hh"
namespace alternator {
class tags_extension : public schema_extension {
public:
static constexpr auto NAME = "scylla_tags";
tags_extension() = default;
explicit tags_extension(const std::map<sstring, sstring>& tags) : _tags(std::move(tags)) {}
explicit tags_extension(bytes b) : _tags(tags_extension::deserialize(b)) {}
explicit tags_extension(const sstring& s) {
throw std::logic_error("Cannot create tags from string");
}
bytes serialize() const override {
return ser::serialize_to_buffer<bytes>(_tags);
}
static std::map<sstring, sstring> deserialize(bytes_view buffer) {
return ser::deserialize_from_buffer(buffer, boost::type<std::map<sstring, sstring>>());
}
const std::map<sstring, sstring>& tags() const {
return _tags;
}
private:
std::map<sstring, sstring> _tags;
};
}

View File

@@ -13,7 +13,7 @@
{
"method":"GET",
"summary":"get row cache save period in seconds",
"type":"int",
"type": "long",
"nickname":"get_row_cache_save_period_in_seconds",
"produces":[
"application/json"
@@ -35,7 +35,7 @@
"description":"row cache save period in seconds",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -48,7 +48,7 @@
{
"method":"GET",
"summary":"get key cache save period in seconds",
"type":"int",
"type": "long",
"nickname":"get_key_cache_save_period_in_seconds",
"produces":[
"application/json"
@@ -70,7 +70,7 @@
"description":"key cache save period in seconds",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -83,7 +83,7 @@
{
"method":"GET",
"summary":"get counter cache save period in seconds",
"type":"int",
"type": "long",
"nickname":"get_counter_cache_save_period_in_seconds",
"produces":[
"application/json"
@@ -105,7 +105,7 @@
"description":"counter cache save period in seconds",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -118,7 +118,7 @@
{
"method":"GET",
"summary":"get row cache keys to save",
"type":"int",
"type": "long",
"nickname":"get_row_cache_keys_to_save",
"produces":[
"application/json"
@@ -140,7 +140,7 @@
"description":"row cache keys to save",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -153,7 +153,7 @@
{
"method":"GET",
"summary":"get key cache keys to save",
"type":"int",
"type": "long",
"nickname":"get_key_cache_keys_to_save",
"produces":[
"application/json"
@@ -175,7 +175,7 @@
"description":"key cache keys to save",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -188,7 +188,7 @@
{
"method":"GET",
"summary":"get counter cache keys to save",
"type":"int",
"type": "long",
"nickname":"get_counter_cache_keys_to_save",
"produces":[
"application/json"
@@ -210,7 +210,7 @@
"description":"counter cache keys to save",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -448,7 +448,7 @@
{
"method": "GET",
"summary": "Get key entries",
"type": "int",
"type": "long",
"nickname": "get_key_entries",
"produces": [
"application/json"
@@ -568,7 +568,7 @@
{
"method": "GET",
"summary": "Get row entries",
"type": "int",
"type": "long",
"nickname": "get_row_entries",
"produces": [
"application/json"
@@ -688,7 +688,7 @@
{
"method": "GET",
"summary": "Get counter entries",
"type": "int",
"type": "long",
"nickname": "get_counter_entries",
"produces": [
"application/json"

View File

@@ -70,7 +70,7 @@
{
"method":"POST",
"summary":"Force a major compaction of this column family",
"type":"string",
"type":"void",
"nickname":"force_major_compaction",
"produces":[
"application/json"
@@ -121,7 +121,7 @@
"description":"The minimum number of sstables in queue before compaction kicks off",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -172,7 +172,7 @@
"description":"The maximum number of sstables in queue before compaction kicks off",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -223,7 +223,7 @@
"description":"The maximum number of sstables in queue before compaction kicks off",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
},
{
@@ -231,7 +231,7 @@
"description":"The minimum number of sstables in queue before compaction kicks off",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -544,7 +544,7 @@
"summary":"sstable count for each level. empty unless leveled compaction is used",
"type":"array",
"items":{
"type":"int"
"type": "long"
},
"nickname":"get_sstable_count_per_level",
"produces":[
@@ -636,7 +636,7 @@
"description":"Duration (in milliseconds) of monitoring operation",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
},
{
@@ -644,7 +644,7 @@
"description":"number of the top partitions to list",
"required":false,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
},
{
@@ -652,7 +652,7 @@
"description":"capacity of stream summary: determines amount of resources used in query processing",
"required":false,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -921,7 +921,7 @@
{
"method":"GET",
"summary":"Get memtable switch count",
"type":"int",
"type": "long",
"nickname":"get_memtable_switch_count",
"produces":[
"application/json"
@@ -945,7 +945,7 @@
{
"method":"GET",
"summary":"Get all memtable switch count",
"type":"int",
"type": "long",
"nickname":"get_all_memtable_switch_count",
"produces":[
"application/json"
@@ -1082,7 +1082,7 @@
{
"method":"GET",
"summary":"Get read latency",
"type":"int",
"type": "long",
"nickname":"get_read_latency",
"produces":[
"application/json"
@@ -1235,7 +1235,7 @@
{
"method":"GET",
"summary":"Get all read latency",
"type":"int",
"type": "long",
"nickname":"get_all_read_latency",
"produces":[
"application/json"
@@ -1251,7 +1251,7 @@
{
"method":"GET",
"summary":"Get range latency",
"type":"int",
"type": "long",
"nickname":"get_range_latency",
"produces":[
"application/json"
@@ -1275,7 +1275,7 @@
{
"method":"GET",
"summary":"Get all range latency",
"type":"int",
"type": "long",
"nickname":"get_all_range_latency",
"produces":[
"application/json"
@@ -1291,7 +1291,7 @@
{
"method":"GET",
"summary":"Get write latency",
"type":"int",
"type": "long",
"nickname":"get_write_latency",
"produces":[
"application/json"
@@ -1444,7 +1444,7 @@
{
"method":"GET",
"summary":"Get all write latency",
"type":"int",
"type": "long",
"nickname":"get_all_write_latency",
"produces":[
"application/json"
@@ -1460,7 +1460,7 @@
{
"method":"GET",
"summary":"Get pending flushes",
"type":"int",
"type": "long",
"nickname":"get_pending_flushes",
"produces":[
"application/json"
@@ -1484,7 +1484,7 @@
{
"method":"GET",
"summary":"Get all pending flushes",
"type":"int",
"type": "long",
"nickname":"get_all_pending_flushes",
"produces":[
"application/json"
@@ -1500,7 +1500,7 @@
{
"method":"GET",
"summary":"Get pending compactions",
"type":"int",
"type": "long",
"nickname":"get_pending_compactions",
"produces":[
"application/json"
@@ -1524,7 +1524,7 @@
{
"method":"GET",
"summary":"Get all pending compactions",
"type":"int",
"type": "long",
"nickname":"get_all_pending_compactions",
"produces":[
"application/json"
@@ -1540,7 +1540,7 @@
{
"method":"GET",
"summary":"Get live ss table count",
"type":"int",
"type": "long",
"nickname":"get_live_ss_table_count",
"produces":[
"application/json"
@@ -1564,7 +1564,7 @@
{
"method":"GET",
"summary":"Get all live ss table count",
"type":"int",
"type": "long",
"nickname":"get_all_live_ss_table_count",
"produces":[
"application/json"
@@ -1580,7 +1580,7 @@
{
"method":"GET",
"summary":"Get live disk space used",
"type":"int",
"type": "long",
"nickname":"get_live_disk_space_used",
"produces":[
"application/json"
@@ -1604,7 +1604,7 @@
{
"method":"GET",
"summary":"Get all live disk space used",
"type":"int",
"type": "long",
"nickname":"get_all_live_disk_space_used",
"produces":[
"application/json"
@@ -1620,7 +1620,7 @@
{
"method":"GET",
"summary":"Get total disk space used",
"type":"int",
"type": "long",
"nickname":"get_total_disk_space_used",
"produces":[
"application/json"
@@ -1644,7 +1644,7 @@
{
"method":"GET",
"summary":"Get all total disk space used",
"type":"int",
"type": "long",
"nickname":"get_all_total_disk_space_used",
"produces":[
"application/json"
@@ -2100,7 +2100,7 @@
{
"method":"GET",
"summary":"Get speculative retries",
"type":"int",
"type": "long",
"nickname":"get_speculative_retries",
"produces":[
"application/json"
@@ -2124,7 +2124,7 @@
{
"method":"GET",
"summary":"Get all speculative retries",
"type":"int",
"type": "long",
"nickname":"get_all_speculative_retries",
"produces":[
"application/json"
@@ -2204,7 +2204,7 @@
{
"method":"GET",
"summary":"Get row cache hit out of range",
"type":"int",
"type": "long",
"nickname":"get_row_cache_hit_out_of_range",
"produces":[
"application/json"
@@ -2228,7 +2228,7 @@
{
"method":"GET",
"summary":"Get all row cache hit out of range",
"type":"int",
"type": "long",
"nickname":"get_all_row_cache_hit_out_of_range",
"produces":[
"application/json"
@@ -2244,7 +2244,7 @@
{
"method":"GET",
"summary":"Get row cache hit",
"type":"int",
"type": "long",
"nickname":"get_row_cache_hit",
"produces":[
"application/json"
@@ -2268,7 +2268,7 @@
{
"method":"GET",
"summary":"Get all row cache hit",
"type":"int",
"type": "long",
"nickname":"get_all_row_cache_hit",
"produces":[
"application/json"
@@ -2284,7 +2284,7 @@
{
"method":"GET",
"summary":"Get row cache miss",
"type":"int",
"type": "long",
"nickname":"get_row_cache_miss",
"produces":[
"application/json"
@@ -2308,7 +2308,7 @@
{
"method":"GET",
"summary":"Get all row cache miss",
"type":"int",
"type": "long",
"nickname":"get_all_row_cache_miss",
"produces":[
"application/json"
@@ -2324,7 +2324,7 @@
{
"method":"GET",
"summary":"Get cas prepare",
"type":"int",
"type": "long",
"nickname":"get_cas_prepare",
"produces":[
"application/json"
@@ -2348,7 +2348,7 @@
{
"method":"GET",
"summary":"Get cas propose",
"type":"int",
"type": "long",
"nickname":"get_cas_propose",
"produces":[
"application/json"
@@ -2372,7 +2372,7 @@
{
"method":"GET",
"summary":"Get cas commit",
"type":"int",
"type": "long",
"nickname":"get_cas_commit",
"produces":[
"application/json"

View File

@@ -118,7 +118,7 @@
{
"method": "GET",
"summary": "Get pending tasks",
"type": "int",
"type": "long",
"nickname": "get_pending_tasks",
"produces": [
"application/json"
@@ -181,7 +181,7 @@
{
"method": "GET",
"summary": "Get bytes compacted",
"type": "int",
"type": "long",
"nickname": "get_bytes_compacted",
"produces": [
"application/json"
@@ -197,7 +197,7 @@
"description":"A row merged information",
"properties":{
"key":{
"type":"int",
"type": "long",
"description":"The number of sstable"
},
"value":{

View File

@@ -0,0 +1,90 @@
{
"apiVersion":"0.0.1",
"swaggerVersion":"1.2",
"basePath":"{{Protocol}}://{{Host}}",
"resourcePath":"/error_injection",
"produces":[
"application/json"
],
"apis":[
{
"path":"/v2/error_injection/injection/{injection}",
"operations":[
{
"method":"POST",
"summary":"Activate an injection that triggers an error in code",
"type":"void",
"nickname":"enable_injection",
"produces":[
"application/json"
],
"parameters":[
{
"name":"injection",
"description":"injection name, should correspond to an injection added in code",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"one_shot",
"description":"boolean flag indicating whether the injection should be enabled to trigger only once",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
},
{
"method":"DELETE",
"summary":"Deactivate an injection previously activated by the API",
"type":"void",
"nickname":"disable_injection",
"produces":[
"application/json"
],
"parameters":[
{
"name":"injection",
"description":"injection name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/v2/error_injection/injection",
"operations":[
{
"method":"GET",
"summary":"List all enabled injections on all shards, i.e. injections that will trigger an error in the code",
"type":"array",
"items":{
"type":"string"
},
"nickname":"get_enabled_injections_on_all",
"produces":[
"application/json"
],
"parameters":[]
},
{
"method":"DELETE",
"summary":"Deactivate all injections previously activated on all shards by the API",
"type":"void",
"nickname":"disable_on_all",
"produces":[
"application/json"
],
"parameters":[]
}
]
}
]
}

View File

@@ -110,7 +110,7 @@
{
"method":"GET",
"summary":"Get count down endpoint",
"type":"int",
"type": "long",
"nickname":"get_down_endpoint_count",
"produces":[
"application/json"
@@ -126,7 +126,7 @@
{
"method":"GET",
"summary":"Get count up endpoint",
"type":"int",
"type": "long",
"nickname":"get_up_endpoint_count",
"produces":[
"application/json"
@@ -180,11 +180,11 @@
"description": "The endpoint address"
},
"generation": {
"type": "int",
"type": "long",
"description": "The heart beat generation"
},
"version": {
"type": "int",
"type": "long",
"description": "The heart beat version"
},
"update_time": {
@@ -209,7 +209,7 @@
"description": "Holds a version value for an application state",
"properties": {
"application_state": {
"type": "int",
"type": "long",
"description": "The application state enum index"
},
"value": {
@@ -217,7 +217,7 @@
"description": "The version value"
},
"version": {
"type": "int",
"type": "long",
"description": "The application state version"
}
}

View File

@@ -75,7 +75,7 @@
{
"method":"GET",
"summary":"Returns files which are pending for archival attempt. Does NOT include failed archive attempts",
"type":"int",
"type": "long",
"nickname":"get_current_generation_number",
"produces":[
"application/json"
@@ -99,7 +99,7 @@
{
"method":"GET",
"summary":"Get heart beat version for a node",
"type":"int",
"type": "long",
"nickname":"get_current_heart_beat_version",
"produces":[
"application/json"

View File

@@ -99,7 +99,7 @@
{
"method": "GET",
"summary": "Get create hint count",
"type": "int",
"type": "long",
"nickname": "get_create_hint_count",
"produces": [
"application/json"
@@ -123,7 +123,7 @@
{
"method": "GET",
"summary": "Get not stored hints count",
"type": "int",
"type": "long",
"nickname": "get_not_stored_hints_count",
"produces": [
"application/json"

View File

@@ -191,7 +191,7 @@
{
"method":"GET",
"summary":"Get the version number",
"type":"int",
"type": "long",
"nickname":"get_version",
"produces":[
"application/json"

View File

@@ -105,7 +105,7 @@
{
"method":"GET",
"summary":"Get the max hint window",
"type":"int",
"type": "long",
"nickname":"get_max_hint_window",
"produces":[
"application/json"
@@ -128,7 +128,7 @@
"description":"max hint window in ms",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -141,7 +141,7 @@
{
"method":"GET",
"summary":"Get max hints in progress",
"type":"int",
"type": "long",
"nickname":"get_max_hints_in_progress",
"produces":[
"application/json"
@@ -164,7 +164,7 @@
"description":"max hints in progress",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -177,7 +177,7 @@
{
"method":"GET",
"summary":"get hints in progress",
"type":"int",
"type": "long",
"nickname":"get_hints_in_progress",
"produces":[
"application/json"
@@ -602,7 +602,7 @@
{
"method": "GET",
"summary": "Get cas write metrics",
"type": "int",
"type": "long",
"nickname": "get_cas_write_metrics_unfinished_commit",
"produces": [
"application/json"
@@ -632,7 +632,7 @@
{
"method": "GET",
"summary": "Get cas write metrics",
"type": "int",
"type": "long",
"nickname": "get_cas_write_metrics_condition_not_met",
"produces": [
"application/json"
@@ -641,13 +641,28 @@
}
]
},
{
"path": "/storage_proxy/metrics/cas_write/failed_read_round_optimization",
"operations": [
{
"method": "GET",
"summary": "Get cas write metrics",
"type": "long",
"nickname": "get_cas_write_metrics_failed_read_round_optimization",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/storage_proxy/metrics/cas_read/unfinished_commit",
"operations": [
{
"method": "GET",
"summary": "Get cas read metrics",
"type": "int",
"type": "long",
"nickname": "get_cas_read_metrics_unfinished_commit",
"produces": [
"application/json"
@@ -677,7 +692,7 @@
{
"method": "GET",
"summary": "Get read metrics",
"type": "int",
"type": "long",
"nickname": "get_read_metrics_timeouts",
"produces": [
"application/json"
@@ -692,7 +707,7 @@
{
"method": "GET",
"summary": "Get read metrics",
"type": "int",
"type": "long",
"nickname": "get_read_metrics_unavailables",
"produces": [
"application/json"
@@ -827,7 +842,7 @@
{
"method": "GET",
"summary": "Get range metrics",
"type": "int",
"type": "long",
"nickname": "get_range_metrics_timeouts",
"produces": [
"application/json"
@@ -842,7 +857,7 @@
{
"method": "GET",
"summary": "Get range metrics",
"type": "int",
"type": "long",
"nickname": "get_range_metrics_unavailables",
"produces": [
"application/json"
@@ -887,7 +902,7 @@
{
"method": "GET",
"summary": "Get write metrics",
"type": "int",
"type": "long",
"nickname": "get_write_metrics_timeouts",
"produces": [
"application/json"
@@ -902,7 +917,7 @@
{
"method": "GET",
"summary": "Get write metrics",
"type": "int",
"type": "long",
"nickname": "get_write_metrics_unavailables",
"produces": [
"application/json"
@@ -1008,7 +1023,7 @@
{
"method":"GET",
"summary":"Get read latency",
"type":"int",
"type": "long",
"nickname":"get_read_latency",
"produces":[
"application/json"
@@ -1040,7 +1055,7 @@
{
"method":"GET",
"summary":"Get write latency",
"type":"int",
"type": "long",
"nickname":"get_write_latency",
"produces":[
"application/json"
@@ -1072,7 +1087,7 @@
{
"method":"GET",
"summary":"Get range latency",
"type":"int",
"type": "long",
"nickname":"get_range_latency",
"produces":[
"application/json"

View File

@@ -458,7 +458,7 @@
{
"method":"GET",
"summary":"Return the generation value for this node.",
"type":"int",
"type": "long",
"nickname":"get_current_generation_number",
"produces":[
"application/json"
@@ -582,7 +582,15 @@
},
{
"name":"kn",
"description":"Comma seperated keyspaces name to snapshot",
"description":"Comma seperated keyspaces name that their snapshot will be deleted",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"cf",
"description":"an optional table name that its snapshot will be deleted",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -646,7 +654,7 @@
{
"method":"POST",
"summary":"Trigger a cleanup of keys on a single keyspace",
"type":"int",
"type": "long",
"nickname":"force_keyspace_cleanup",
"produces":[
"application/json"
@@ -678,7 +686,7 @@
{
"method":"GET",
"summary":"Scrub (deserialize + reserialize at the latest version, skipping bad rows if any) the given keyspace. If columnFamilies array is empty, all CFs are scrubbed. Scrubbed CFs will be snapshotted first, if disableSnapshot is false",
"type":"int",
"type": "long",
"nickname":"scrub",
"produces":[
"application/json"
@@ -726,7 +734,7 @@
{
"method":"GET",
"summary":"Rewrite all sstables to the latest version. Unlike scrub, it doesn't skip bad rows and do not snapshot sstables first.",
"type":"int",
"type": "long",
"nickname":"upgrade_sstables",
"produces":[
"application/json"
@@ -800,7 +808,7 @@
"summary":"Return an array with the ids of the currently active repairs",
"type":"array",
"items":{
"type":"int"
"type": "long"
},
"nickname":"get_active_repair_async",
"produces":[
@@ -816,7 +824,7 @@
{
"method":"POST",
"summary":"Invoke repair asynchronously. You can track repair progress by using the get supplying id",
"type":"int",
"type": "long",
"nickname":"repair_async",
"produces":[
"application/json"
@@ -947,7 +955,7 @@
"description":"The repair ID to check for status",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -1277,18 +1285,18 @@
},
{
"name":"dynamic_update_interval",
"description":"integer, in ms (default 100)",
"description":"interval in ms (default 100)",
"required":false,
"allowMultiple":false,
"type":"integer",
"type":"long",
"paramType":"query"
},
{
"name":"dynamic_reset_interval",
"description":"integer, in ms (default 600,000)",
"description":"interval in ms (default 600,000)",
"required":false,
"allowMultiple":false,
"type":"integer",
"type":"long",
"paramType":"query"
},
{
@@ -1493,7 +1501,7 @@
"description":"Stream throughput",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -1501,7 +1509,7 @@
{
"method":"GET",
"summary":"Get stream throughput mb per sec",
"type":"int",
"type": "long",
"nickname":"get_stream_throughput_mb_per_sec",
"produces":[
"application/json"
@@ -1517,7 +1525,7 @@
{
"method":"GET",
"summary":"get compaction throughput mb per sec",
"type":"int",
"type": "long",
"nickname":"get_compaction_throughput_mb_per_sec",
"produces":[
"application/json"
@@ -1539,7 +1547,7 @@
"description":"compaction throughput",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -1943,7 +1951,7 @@
{
"method":"GET",
"summary":"Returns the threshold for warning of queries with many tombstones",
"type":"int",
"type": "long",
"nickname":"get_tombstone_warn_threshold",
"produces":[
"application/json"
@@ -1965,7 +1973,7 @@
"description":"tombstone debug threshold",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -1978,7 +1986,7 @@
{
"method":"GET",
"summary":"",
"type":"int",
"type": "long",
"nickname":"get_tombstone_failure_threshold",
"produces":[
"application/json"
@@ -2000,7 +2008,7 @@
"description":"tombstone debug threshold",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -2013,7 +2021,7 @@
{
"method":"GET",
"summary":"Returns the threshold for rejecting queries due to a large batch size",
"type":"int",
"type": "long",
"nickname":"get_batch_size_failure_threshold",
"produces":[
"application/json"
@@ -2035,7 +2043,7 @@
"description":"batch size debug threshold",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -2059,7 +2067,7 @@
"description":"throttle in kb",
"required":true,
"allowMultiple":false,
"type":"int",
"type": "long",
"paramType":"query"
}
]
@@ -2072,7 +2080,7 @@
{
"method":"GET",
"summary":"Get load",
"type":"int",
"type": "long",
"nickname":"get_metrics_load",
"produces":[
"application/json"
@@ -2088,7 +2096,7 @@
{
"method":"GET",
"summary":"Get exceptions",
"type":"int",
"type": "long",
"nickname":"get_exceptions",
"produces":[
"application/json"
@@ -2104,7 +2112,7 @@
{
"method":"GET",
"summary":"Get total hints in progress",
"type":"int",
"type": "long",
"nickname":"get_total_hints_in_progress",
"produces":[
"application/json"
@@ -2120,7 +2128,7 @@
{
"method":"GET",
"summary":"Get total hints",
"type":"int",
"type": "long",
"nickname":"get_total_hints",
"produces":[
"application/json"

View File

@@ -32,7 +32,7 @@
{
"method":"GET",
"summary":"Get number of active outbound streams",
"type":"int",
"type": "long",
"nickname":"get_all_active_streams_outbound",
"produces":[
"application/json"
@@ -48,7 +48,7 @@
{
"method":"GET",
"summary":"Get total incoming bytes",
"type":"int",
"type": "long",
"nickname":"get_total_incoming_bytes",
"produces":[
"application/json"
@@ -72,7 +72,7 @@
{
"method":"GET",
"summary":"Get all total incoming bytes",
"type":"int",
"type": "long",
"nickname":"get_all_total_incoming_bytes",
"produces":[
"application/json"
@@ -88,7 +88,7 @@
{
"method":"GET",
"summary":"Get total outgoing bytes",
"type":"int",
"type": "long",
"nickname":"get_total_outgoing_bytes",
"produces":[
"application/json"
@@ -112,7 +112,7 @@
{
"method":"GET",
"summary":"Get all total outgoing bytes",
"type":"int",
"type": "long",
"nickname":"get_all_total_outgoing_bytes",
"produces":[
"application/json"
@@ -154,7 +154,7 @@
"description":"The peer"
},
"session_index":{
"type":"int",
"type": "long",
"description":"The session index"
},
"connecting":{
@@ -211,7 +211,7 @@
"description":"The ID"
},
"files":{
"type":"int",
"type": "long",
"description":"Number of files to transfer. Can be 0 if nothing to transfer for some streaming request."
},
"total_size":{
@@ -242,7 +242,7 @@
"description":"The peer address"
},
"session_index":{
"type":"int",
"type": "long",
"description":"The session index"
},
"file_name":{

View File

@@ -52,6 +52,21 @@
}
]
},
{
"path":"/system/uptime_ms",
"operations":[
{
"method":"GET",
"summary":"Get system uptime, in milliseconds",
"type":"long",
"nickname":"get_system_uptime",
"produces":[
"application/json"
],
"parameters":[]
}
]
},
{
"path":"/system/logger/{name}",
"operations":[

View File

@@ -36,6 +36,7 @@
#include "endpoint_snitch.hh"
#include "compaction_manager.hh"
#include "hinted_handoff.hh"
#include "error_injection.hh"
#include <seastar/http/exception.hh>
#include "stream_manager.hh"
#include "system.hh"
@@ -68,13 +69,19 @@ future<> set_server_init(http_context& ctx) {
rb->set_api_doc(r);
rb02->set_api_doc(r);
rb02->register_api_file(r, "swagger20_header");
set_config(rb02, ctx, r);
rb->register_function(r, "system",
"The system related API");
set_system(ctx, r);
});
}
future<> set_server_config(http_context& ctx) {
auto rb02 = std::make_shared < api_registry_builder20 > (ctx.api_doc, "/v2");
return ctx.http_server.set_routes([&ctx, rb02](routes& r) {
set_config(rb02, ctx, r);
});
}
static future<> register_api(http_context& ctx, const sstring& api_name,
const sstring api_desc,
std::function<void(http_context& ctx, routes& r)> f) {
@@ -90,6 +97,10 @@ future<> set_server_storage_service(http_context& ctx) {
return register_api(ctx, "storage_service", "The storage service API", set_storage_service);
}
future<> set_server_snapshot(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { set_snapshot(ctx, r); });
}
future<> set_server_snitch(http_context& ctx) {
return register_api(ctx, "endpoint_snitch_info", "The endpoint snitch info API", set_endpoint_snitch);
}
@@ -153,6 +164,9 @@ future<> set_server_done(http_context& ctx) {
rb->register_function(r, "collectd",
"The collectd API");
set_collectd(ctx, r);
rb->register_function(r, "error_injection",
"The error injection API");
set_error_injection(ctx, r);
});
}

View File

@@ -23,6 +23,9 @@
#include "service/storage_proxy.hh"
#include <seastar/http/httpd.hh>
namespace service { class load_meter; }
namespace locator { class token_metadata; }
namespace api {
struct http_context {
@@ -31,15 +34,21 @@ struct http_context {
httpd::http_server_control http_server;
distributed<database>& db;
distributed<service::storage_proxy>& sp;
service::load_meter& lmeter;
sharded<locator::token_metadata>& token_metadata;
http_context(distributed<database>& _db,
distributed<service::storage_proxy>& _sp)
: db(_db), sp(_sp) {
distributed<service::storage_proxy>& _sp,
service::load_meter& _lm, sharded<locator::token_metadata>& _tm)
: db(_db), sp(_sp), lmeter(_lm), token_metadata(_tm) {
}
};
future<> set_server_init(http_context& ctx);
future<> set_server_config(http_context& ctx);
future<> set_server_snitch(http_context& ctx);
future<> set_server_storage_service(http_context& ctx);
future<> set_server_snapshot(http_context& ctx);
future<> set_server_gossip(http_context& ctx);
future<> set_server_load_sstable(http_context& ctx);
future<> set_server_messaging_service(http_context& ctx);

View File

@@ -64,7 +64,7 @@ static const char* str_to_regex(const sstring& v) {
void set_collectd(http_context& ctx, routes& r) {
cd::get_collectd.set(r, [&ctx](std::unique_ptr<request> req) {
auto id = make_shared<scollectd::type_instance_id>(req->param["pluginid"],
auto id = ::make_shared<scollectd::type_instance_id>(req->param["pluginid"],
req->get_query_param("instance"), req->get_query_param("type"),
req->get_query_param("type_instance"));

View File

@@ -994,5 +994,15 @@ void set_column_family(http_context& ctx, routes& r) {
});
});
cf::force_major_compaction.set(r, [&ctx](std::unique_ptr<request> req) {
if (req->get_query_param("split_output") != "") {
fail(unimplemented::cause::API);
}
return foreach_column_family(ctx, req->param["name"], [](column_family &cf) {
return cf.compact_all_sstables();
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
}
}

66
api/error_injection.cc Normal file
View File

@@ -0,0 +1,66 @@
/*
* Copyright (C) 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "api/api-doc/error_injection.json.hh"
#include "api/api.hh"
#include <seastar/http/exception.hh>
#include "log.hh"
#include "utils/error_injection.hh"
#include "seastar/core/future-util.hh"
namespace api {
namespace hf = httpd::error_injection_json;
void set_error_injection(http_context& ctx, routes& r) {
hf::enable_injection.set(r, [](std::unique_ptr<request> req) {
sstring injection = req->param["injection"];
bool one_shot = req->get_query_param("one_shot") == "True";
auto& errinj = utils::get_local_injector();
errinj.enable_on_all(injection, one_shot);
return make_ready_future<json::json_return_type>(json::json_void());
});
hf::get_enabled_injections_on_all.set(r, [](std::unique_ptr<request> req) {
auto& errinj = utils::get_local_injector();
auto ret = errinj.enabled_injections_on_all();
return make_ready_future<json::json_return_type>(ret);
});
hf::disable_injection.set(r, [](std::unique_ptr<request> req) {
sstring injection = req->param["injection"];
auto& errinj = utils::get_local_injector();
errinj.disable_on_all(injection);
return make_ready_future<json::json_return_type>(json::json_void());
});
hf::disable_on_all.set(r, [](std::unique_ptr<request> req) {
auto& errinj = utils::get_local_injector();
errinj.disable_on_all();
return make_ready_future<json::json_return_type>(json::json_void());
});
}
} // namespace api

30
api/error_injection.hh Normal file
View File

@@ -0,0 +1,30 @@
/*
* Copyright (C) 2019 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "api.hh"
namespace api {
void set_error_injection(http_context& ctx, routes& r);
}

View File

@@ -27,6 +27,7 @@
#include "db/config.hh"
#include "utils/histogram.hh"
#include "database.hh"
#include "seastar/core/scheduling_specific.hh"
namespace api {
@@ -34,12 +35,70 @@ namespace sp = httpd::storage_proxy_json;
using proxy = service::storage_proxy;
using namespace json;
static future<utils::rate_moving_average> sum_timed_rate(distributed<proxy>& d, utils::timed_rate_moving_average proxy::stats::*f) {
return d.map_reduce0([f](const proxy& p) {return (p.get_stats().*f).rate();}, utils::rate_moving_average(),
std::plus<utils::rate_moving_average>());
/**
* This function implement a two dimentional map reduce where
* the first level is a distributed storage_proxy class and the
* second level is the stats per scheduling group class.
* @param d - a reference to the storage_proxy distributed class.
* @param mapper - the internal mapper that is used to map the internal
* stat class into a value of type `V`.
* @param reducer - the reducer that is used in both outer and inner
* aggregations.
* @param initial_value - the initial value to use for both aggregations
* @return A future that resolves to the result of the aggregation.
*/
template<typename V, typename Reducer, typename InnerMapper>
future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,
InnerMapper mapper, Reducer reducer, V initial_value) {
return d.map_reduce0( [mapper, reducer, initial_value] (const service::storage_proxy& sp) {
return map_reduce_scheduling_group_specific<service::storage_proxy_stats::stats>(
mapper, reducer, initial_value, sp.get_stats_key());
}, initial_value, reducer);
}
static future<json::json_return_type> sum_timed_rate_as_obj(distributed<proxy>& d, utils::timed_rate_moving_average proxy::stats::*f) {
/**
* This function implement a two dimentional map reduce where
* the first level is a distributed storage_proxy class and the
* second level is the stats per scheduling group class.
* @param d - a reference to the storage_proxy distributed class.
* @param f - a field pointer which is the implicit internal reducer.
* @param reducer - the reducer that is used in both outer and inner
* aggregations.
* @param initial_value - the initial value to use for both aggregations* @return
* @return A future that resolves to the result of the aggregation.
*/
template<typename V, typename Reducer, typename F>
future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,
V F::*f, Reducer reducer, V initial_value) {
return two_dimensional_map_reduce(d, [f] (F& stats) {
return stats.*f;
}, reducer, initial_value);
}
/**
* A partial Specialization of sum_stats for the storage proxy
* case where the get stats function doesn't return a
* stats object with fields but a per scheduling group
* stats object, the name was also changed since functions
* partial specialization is not supported in C++.
*
*/
template<typename V, typename F>
future<json::json_return_type> sum_stats_storage_proxy(distributed<proxy>& d, V F::*f) {
return two_dimensional_map_reduce(d, [f] (F& stats) { return stats.*f; }, std::plus<V>(), V(0)).then([] (V val) {
return make_ready_future<json::json_return_type>(val);
});
}
static future<utils::rate_moving_average> sum_timed_rate(distributed<proxy>& d, utils::timed_rate_moving_average service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(d, [f] (service::storage_proxy_stats::stats& stats) {
return (stats.*f).rate();
}, std::plus<utils::rate_moving_average>(), utils::rate_moving_average());
}
static future<json::json_return_type> sum_timed_rate_as_obj(distributed<proxy>& d, utils::timed_rate_moving_average service::storage_proxy_stats::stats::*f) {
return sum_timed_rate(d, f).then([](const utils::rate_moving_average& val) {
httpd::utils_json::rate_moving_average m;
m = val;
@@ -51,29 +110,72 @@ httpd::utils_json::rate_moving_average_and_histogram get_empty_moving_average()
return timer_to_json(utils::rate_moving_average_and_histogram());
}
static future<json::json_return_type> sum_timed_rate_as_long(distributed<proxy>& d, utils::timed_rate_moving_average proxy::stats::*f) {
static future<json::json_return_type> sum_timed_rate_as_long(distributed<proxy>& d, utils::timed_rate_moving_average service::storage_proxy_stats::stats::*f) {
return sum_timed_rate(d, f).then([](const utils::rate_moving_average& val) {
return make_ready_future<json::json_return_type>(val.count);
});
}
static future<json::json_return_type> sum_estimated_histogram(http_context& ctx, utils::estimated_histogram proxy::stats::*f) {
return ctx.sp.map_reduce0([f](const proxy& p) {return p.get_stats().*f;}, utils::estimated_histogram(),
utils::estimated_histogram_merge).then([](const utils::estimated_histogram& val) {
static future<json::json_return_type> sum_estimated_histogram(http_context& ctx, utils::estimated_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(ctx.sp, f, utils::estimated_histogram_merge,
utils::estimated_histogram()).then([](const utils::estimated_histogram& val) {
utils_json::estimated_histogram res;
res = val;
return make_ready_future<json::json_return_type>(res);
});
}
static future<json::json_return_type> total_latency(http_context& ctx, utils::timed_rate_moving_average_and_histogram proxy::stats::*f) {
return ctx.sp.map_reduce0([f](const proxy& p) {return (p.get_stats().*f).hist.mean * (p.get_stats().*f).hist.count;}, 0.0,
std::plus<double>()).then([](double val) {
static future<json::json_return_type> total_latency(http_context& ctx, utils::timed_rate_moving_average_and_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(ctx.sp, [f] (service::storage_proxy_stats::stats& stats) {
return (stats.*f).hist.mean * (stats.*f).hist.count;
}, std::plus<double>(), 0.0).then([](double val) {
int64_t res = val;
return make_ready_future<json::json_return_type>(res);
});
}
/**
* A partial Specialization of sum_histogram_stats
* for the storage proxy case where the get stats
* function doesn't return a stats object with
* fields but a per scheduling group stats object,
* the name was also changed since function partial
* specialization is not supported in C++.
*/
template<typename F>
future<json::json_return_type>
sum_histogram_stats_storage_proxy(distributed<proxy>& d,
utils::timed_rate_moving_average_and_histogram F::*f) {
return two_dimensional_map_reduce(d, [f] (service::storage_proxy_stats::stats& stats) {
return (stats.*f).hist;
}, std::plus<utils::ihistogram>(), utils::ihistogram()).
then([](const utils::ihistogram& val) {
return make_ready_future<json::json_return_type>(to_json(val));
});
}
/**
* A partial Specialization of sum_timer_stats for the
* storage proxy case where the get stats function
* doesn't return a stats object with fields but a
* per scheduling group stats object, the name
* was also changed since partial function specialization
* is not supported in C++.
*/
template<typename F>
future<json::json_return_type>
sum_timer_stats_storage_proxy(distributed<proxy>& d,
utils::timed_rate_moving_average_and_histogram F::*f) {
return two_dimensional_map_reduce(d, [f] (service::storage_proxy_stats::stats& stats) {
return (stats.*f).rate();
}, std::plus<utils::rate_moving_average_and_histogram>(),
utils::rate_moving_average_and_histogram()).then([](const utils::rate_moving_average_and_histogram& val) {
return make_ready_future<json::json_return_type>(timer_to_json(val));
});
}
void set_storage_proxy(http_context& ctx, routes& r) {
sp::get_total_hints.set(r, [](std::unique_ptr<request> req) {
//TBD
@@ -223,15 +325,15 @@ void set_storage_proxy(http_context& ctx, routes& r) {
});
sp::get_read_repair_attempted.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_stats(ctx.sp, &proxy::stats::read_repair_attempts);
return sum_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read_repair_attempts);
});
sp::get_read_repair_repaired_blocking.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_stats(ctx.sp, &proxy::stats::read_repair_repaired_blocking);
return sum_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read_repair_repaired_blocking);
});
sp::get_read_repair_repaired_background.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_stats(ctx.sp, &proxy::stats::read_repair_repaired_background);
return sum_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read_repair_repaired_background);
});
sp::get_schema_versions.set(r, [](std::unique_ptr<request> req) {
@@ -275,6 +377,10 @@ void set_storage_proxy(http_context& ctx, routes& r) {
return sum_stats(ctx.sp, &proxy::stats::cas_write_condition_not_met);
});
sp::get_cas_write_metrics_failed_read_round_optimization.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_stats(ctx.sp, &proxy::stats::cas_failed_read_round_optimization);
});
sp::get_cas_read_metrics_unfinished_commit.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_stats(ctx.sp, &proxy::stats::cas_read_unfinished_commit);
});
@@ -284,71 +390,71 @@ void set_storage_proxy(http_context& ctx, routes& r) {
});
sp::get_read_metrics_timeouts.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timed_rate_as_long(ctx.sp, &proxy::stats::read_timeouts);
return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::read_timeouts);
});
sp::get_read_metrics_unavailables.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timed_rate_as_long(ctx.sp, &proxy::stats::read_unavailables);
return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::read_unavailables);
});
sp::get_range_metrics_timeouts.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timed_rate_as_long(ctx.sp, &proxy::stats::range_slice_timeouts);
return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::range_slice_timeouts);
});
sp::get_range_metrics_unavailables.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timed_rate_as_long(ctx.sp, &proxy::stats::range_slice_unavailables);
return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::range_slice_unavailables);
});
sp::get_write_metrics_timeouts.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timed_rate_as_long(ctx.sp, &proxy::stats::write_timeouts);
return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::write_timeouts);
});
sp::get_write_metrics_unavailables.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timed_rate_as_long(ctx.sp, &proxy::stats::write_unavailables);
return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::write_unavailables);
});
sp::get_read_metrics_timeouts_rates.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timed_rate_as_obj(ctx.sp, &proxy::stats::read_timeouts);
return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::read_timeouts);
});
sp::get_read_metrics_unavailables_rates.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timed_rate_as_obj(ctx.sp, &proxy::stats::read_unavailables);
return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::read_unavailables);
});
sp::get_range_metrics_timeouts_rates.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timed_rate_as_obj(ctx.sp, &proxy::stats::range_slice_timeouts);
return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::range_slice_timeouts);
});
sp::get_range_metrics_unavailables_rates.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timed_rate_as_obj(ctx.sp, &proxy::stats::range_slice_unavailables);
return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::range_slice_unavailables);
});
sp::get_write_metrics_timeouts_rates.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timed_rate_as_obj(ctx.sp, &proxy::stats::write_timeouts);
return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::write_timeouts);
});
sp::get_write_metrics_unavailables_rates.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timed_rate_as_obj(ctx.sp, &proxy::stats::write_unavailables);
return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::write_unavailables);
});
sp::get_range_metrics_latency_histogram_depricated.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_histogram_stats(ctx.sp, &proxy::stats::range);
return sum_histogram_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::range);
});
sp::get_write_metrics_latency_histogram_depricated.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_histogram_stats(ctx.sp, &proxy::stats::write);
return sum_histogram_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::write);
});
sp::get_read_metrics_latency_histogram_depricated.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_histogram_stats(ctx.sp, &proxy::stats::read);
return sum_histogram_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read);
});
sp::get_range_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timer_stats(ctx.sp, &proxy::stats::range);
return sum_timer_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::range);
});
sp::get_write_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timer_stats(ctx.sp, &proxy::stats::write);
return sum_timer_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::write);
});
sp::get_cas_write_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timer_stats(ctx.sp, &proxy::stats::cas_write);
@@ -367,30 +473,30 @@ void set_storage_proxy(http_context& ctx, routes& r) {
});
sp::get_read_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timer_stats(ctx.sp, &proxy::stats::read);
return sum_timer_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read);
});
sp::get_read_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_estimated_histogram(ctx, &proxy::stats::estimated_read);
return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::estimated_read);
});
sp::get_read_latency.set(r, [&ctx](std::unique_ptr<request> req) {
return total_latency(ctx, &proxy::stats::read);
return total_latency(ctx, &service::storage_proxy_stats::stats::read);
});
sp::get_write_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_estimated_histogram(ctx, &proxy::stats::estimated_write);
return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::estimated_write);
});
sp::get_write_latency.set(r, [&ctx](std::unique_ptr<request> req) {
return total_latency(ctx, &proxy::stats::write);
return total_latency(ctx, &service::storage_proxy_stats::stats::write);
});
sp::get_range_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timer_stats(ctx.sp, &proxy::stats::range);
return sum_timer_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::range);
});
sp::get_range_latency.set(r, [&ctx](std::unique_ptr<request> req) {
return total_latency(ctx, &proxy::stats::range);
return total_latency(ctx, &service::storage_proxy_stats::stats::range);
});
}

View File

@@ -27,6 +27,7 @@
#include <boost/range/adaptor/map.hpp>
#include <boost/range/adaptor/filtered.hpp>
#include "service/storage_service.hh"
#include "service/load_meter.hh"
#include "db/commitlog/commitlog.hh"
#include "gms/gossiper.hh"
#include "db/system_keyspace.hh"
@@ -41,8 +42,6 @@
#include "database.hh"
#include "db/extensions.hh"
sstables::sstable::version_types get_highest_supported_format();
namespace api {
namespace ss = httpd::storage_service_json;
@@ -55,57 +54,53 @@ static sstring validate_keyspace(http_context& ctx, const parameters& param) {
throw bad_param_exception("Keyspace " + param["keyspace"] + " Does not exist");
}
static std::vector<ss::token_range> describe_ring(const sstring& keyspace) {
std::vector<ss::token_range> res;
for (auto d : service::get_local_storage_service().describe_ring(keyspace)) {
ss::token_range r;
r.start_token = d._start_token;
r.end_token = d._end_token;
r.endpoints = d._endpoints;
r.rpc_endpoints = d._rpc_endpoints;
for (auto det : d._endpoint_details) {
ss::endpoint_detail ed;
ed.host = det._host;
ed.datacenter = det._datacenter;
if (det._rack != "") {
ed.rack = det._rack;
}
r.endpoint_details.push(ed);
static ss::token_range token_range_endpoints_to_json(const dht::token_range_endpoints& d) {
ss::token_range r;
r.start_token = d._start_token;
r.end_token = d._end_token;
r.endpoints = d._endpoints;
r.rpc_endpoints = d._rpc_endpoints;
for (auto det : d._endpoint_details) {
ss::endpoint_detail ed;
ed.host = det._host;
ed.datacenter = det._datacenter;
if (det._rack != "") {
ed.rack = det._rack;
}
res.push_back(r);
r.endpoint_details.push(ed);
}
return res;
return r;
}
using ks_cf_func = std::function<future<json::json_return_type>(http_context&, std::unique_ptr<request>, sstring, std::vector<sstring>)>;
static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
return [&ctx, f = std::move(f)](std::unique_ptr<request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto column_families = split_cf(req->get_query_param("cf"));
if (column_families.empty()) {
column_families = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
}
return f(ctx, std::move(req), std::move(keyspace), std::move(column_families));
};
}
void set_storage_service(http_context& ctx, routes& r) {
using ks_cf_func = std::function<future<json::json_return_type>(std::unique_ptr<request>, sstring, std::vector<sstring>)>;
auto wrap_ks_cf = [&ctx](ks_cf_func f) {
return [&ctx, f = std::move(f)](std::unique_ptr<request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto column_families = split_cf(req->get_query_param("cf"));
if (column_families.empty()) {
column_families = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
}
return f(std::move(req), std::move(keyspace), std::move(column_families));
};
};
ss::local_hostid.set(r, [](std::unique_ptr<request> req) {
return db::system_keyspace::get_local_host_id().then([](const utils::UUID& id) {
return make_ready_future<json::json_return_type>(id.to_sstring());
});
});
ss::get_tokens.set(r, [] (std::unique_ptr<request> req) {
return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().get_token_metadata().sorted_tokens(), [](const dht::token& i) {
ss::get_tokens.set(r, [&ctx] (std::unique_ptr<request> req) {
return make_ready_future<json::json_return_type>(stream_range_as_array(ctx.token_metadata.local().sorted_tokens(), [](const dht::token& i) {
return boost::lexical_cast<std::string>(i);
}));
});
ss::get_node_tokens.set(r, [] (std::unique_ptr<request> req) {
ss::get_node_tokens.set(r, [&ctx] (std::unique_ptr<request> req) {
gms::inet_address addr(req->param["endpoint"]);
return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().get_token_metadata().get_tokens(addr), [](const dht::token& i) {
return make_ready_future<json::json_return_type>(stream_range_as_array(ctx.token_metadata.local().get_tokens(addr), [](const dht::token& i) {
return boost::lexical_cast<std::string>(i);
}));
});
@@ -123,8 +118,8 @@ void set_storage_service(http_context& ctx, routes& r) {
}));
});
ss::get_leaving_nodes.set(r, [](const_req req) {
return container_to_vec(service::get_local_storage_service().get_token_metadata().get_leaving_endpoints());
ss::get_leaving_nodes.set(r, [&ctx](const_req req) {
return container_to_vec(ctx.token_metadata.local().get_leaving_endpoints());
});
ss::get_moving_nodes.set(r, [](const_req req) {
@@ -132,8 +127,8 @@ void set_storage_service(http_context& ctx, routes& r) {
return container_to_vec(addr);
});
ss::get_joining_nodes.set(r, [](const_req req) {
auto points = service::get_local_storage_service().get_token_metadata().get_bootstrap_tokens();
ss::get_joining_nodes.set(r, [&ctx](const_req req) {
auto points = ctx.token_metadata.local().get_bootstrap_tokens();
std::unordered_set<sstring> addr;
for (auto i: points) {
addr.insert(boost::lexical_cast<std::string>(i.second));
@@ -176,27 +171,26 @@ void set_storage_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(res);
});
ss::describe_any_ring.set(r, [&ctx](const_req req) {
return describe_ring("");
ss::describe_any_ring.set(r, [&ctx](std::unique_ptr<request> req) {
return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().describe_ring(""), token_range_endpoints_to_json));
});
ss::describe_ring.set(r, [&ctx](const_req req) {
auto keyspace = validate_keyspace(ctx, req.param);
return describe_ring(keyspace);
ss::describe_ring.set(r, [&ctx](std::unique_ptr<request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().describe_ring(keyspace), token_range_endpoints_to_json));
});
ss::get_host_id_map.set(r, [](const_req req) {
ss::get_host_id_map.set(r, [&ctx](const_req req) {
std::vector<ss::mapper> res;
return map_to_key_value(service::get_local_storage_service().
get_token_metadata().get_endpoint_to_host_id_map_for_reading(), res);
return map_to_key_value(ctx.token_metadata.local().get_endpoint_to_host_id_map_for_reading(), res);
});
ss::get_load.set(r, [&ctx](std::unique_ptr<request> req) {
return get_cf_stats(ctx, &column_family_stats::live_disk_space_used);
});
ss::get_load_map.set(r, [] (std::unique_ptr<request> req) {
return service::get_local_storage_service().get_load_map().then([] (auto&& load_map) {
ss::get_load_map.set(r, [&ctx] (std::unique_ptr<request> req) {
return ctx.lmeter.get_load_map().then([] (auto&& load_map) {
std::vector<ss::map_string_double> res;
for (auto i : load_map) {
ss::map_string_double val;
@@ -221,67 +215,6 @@ void set_storage_service(http_context& ctx, routes& r) {
req.get_query_param("key")));
});
ss::get_snapshot_details.set(r, [](std::unique_ptr<request> req) {
return service::get_local_storage_service().get_snapshot_details().then([] (auto result) {
std::vector<ss::snapshots> res;
for (auto& map: result) {
ss::snapshots all_snapshots;
all_snapshots.key = map.first;
std::vector<ss::snapshot> snapshot;
for (auto& cf: map.second) {
ss::snapshot s;
s.ks = cf.ks;
s.cf = cf.cf;
s.live = cf.live;
s.total = cf.total;
snapshot.push_back(std::move(s));
}
all_snapshots.value = std::move(snapshot);
res.push_back(std::move(all_snapshots));
}
return make_ready_future<json::json_return_type>(std::move(res));
});
});
ss::take_snapshot.set(r, [](std::unique_ptr<request> req) {
auto tag = req->get_query_param("tag");
auto column_family = req->get_query_param("cf");
std::vector<sstring> keynames = split(req->get_query_param("kn"), ",");
auto resp = make_ready_future<>();
if (column_family.empty()) {
resp = service::get_local_storage_service().take_snapshot(tag, keynames);
} else {
if (keynames.empty()) {
throw httpd::bad_param_exception("The keyspace of column families must be specified");
}
if (keynames.size() > 1) {
throw httpd::bad_param_exception("Only one keyspace allowed when specifying a column family");
}
resp = service::get_local_storage_service().take_column_family_snapshot(keynames[0], column_family, tag);
}
return resp.then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::del_snapshot.set(r, [](std::unique_ptr<request> req) {
auto tag = req->get_query_param("tag");
std::vector<sstring> keynames = split(req->get_query_param("kn"), ",");
return service::get_local_storage_service().clear_snapshot(tag, keynames).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::true_snapshots_size.set(r, [](std::unique_ptr<request> req) {
return service::get_local_storage_service().true_snapshots_size().then([] (int64_t size) {
return make_ready_future<json::json_return_type>(size);
});
});
ss::force_keyspace_compaction.set(r, [&ctx](std::unique_ptr<request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto column_families = split_cf(req->get_query_param("cf"));
@@ -319,8 +252,8 @@ void set_storage_service(http_context& ctx, routes& r) {
for (auto cf : column_families) {
column_families_vec.push_back(&db.find_column_family(keyspace, cf));
}
return parallel_for_each(column_families_vec, [&cm] (column_family* cf) {
return cm.perform_cleanup(cf);
return parallel_for_each(column_families_vec, [&cm, &db] (column_family* cf) {
return cm.perform_cleanup(db, cf);
});
}).then([]{
return make_ready_future<json::json_return_type>(0);
@@ -328,32 +261,7 @@ void set_storage_service(http_context& ctx, routes& r) {
});
});
ss::scrub.set(r, wrap_ks_cf([&ctx](std::unique_ptr<request> req, sstring keyspace, std::vector<sstring> column_families) {
// TODO: respect this
auto skip_corrupted = req->get_query_param("skip_corrupted");
auto f = make_ready_future<>();
if (!req_param<bool>(*req, "disable_snapshot", false)) {
auto tag = format("pre-scrub-{:d}", db_clock::now().time_since_epoch().count());
f = parallel_for_each(column_families, [keyspace, tag](sstring cf) {
return service::get_local_storage_service().take_column_family_snapshot(keyspace, cf, tag);
});
}
return f.then([&ctx, keyspace, column_families] {
return ctx.db.invoke_on_all([=] (database& db) {
return do_for_each(column_families, [=, &db](sstring cfname) {
auto& cm = db.get_compaction_manager();
auto& cf = db.find_column_family(keyspace, cfname);
return cm.perform_sstable_scrub(&cf);
});
});
}).then([]{
return make_ready_future<json::json_return_type>(0);
});
}));
ss::upgrade_sstables.set(r, wrap_ks_cf([&ctx](std::unique_ptr<request> req, sstring keyspace, std::vector<sstring> column_families) {
ss::upgrade_sstables.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<sstring> column_families) {
bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);
return ctx.db.invoke_on_all([=] (database& db) {
@@ -608,9 +516,7 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::join_ring.set(r, [](std::unique_ptr<request> req) {
return service::get_local_storage_service().join_ring().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
return make_ready_future<json::json_return_type>(json_void());
});
ss::is_joined.set(r, [] (std::unique_ptr<request> req) {
@@ -1041,4 +947,107 @@ void set_storage_service(http_context& ctx, routes& r) {
}
void set_snapshot(http_context& ctx, routes& r) {
ss::get_snapshot_details.set(r, [](std::unique_ptr<request> req) {
std::function<future<>(output_stream<char>&&)> f = [](output_stream<char>&& s) {
return do_with(output_stream<char>(std::move(s)), true, [] (output_stream<char>& s, bool& first){
return s.write("[").then([&s, &first] {
return service::get_local_storage_service().get_snapshot_details().then([&s, &first] (std::unordered_map<sstring, std::vector<service::storage_service::snapshot_details>>&& result) {
return do_with(std::move(result), [&s, &first](const std::unordered_map<sstring, std::vector<service::storage_service::snapshot_details>>& result) {
return do_for_each(result, [&s, &result,&first](std::tuple<sstring, std::vector<service::storage_service::snapshot_details>>&& map){
return do_with(ss::snapshots(), [&s, &first, &result, &map](ss::snapshots& all_snapshots) {
all_snapshots.key = std::get<0>(map);
future<> f = first ? make_ready_future<>() : s.write(", ");
first = false;
std::vector<ss::snapshot> snapshot;
for (auto& cf: std::get<1>(map)) {
ss::snapshot snp;
snp.ks = cf.ks;
snp.cf = cf.cf;
snp.live = cf.live;
snp.total = cf.total;
snapshot.push_back(std::move(snp));
}
all_snapshots.value = std::move(snapshot);
return f.then([&s, &all_snapshots] {
return all_snapshots.write(s);
});
});
});
});
}).then([&s] {
return s.write("]").then([&s] {
return s.close();
});
});
});
});
};
return make_ready_future<json::json_return_type>(std::move(f));
});
ss::take_snapshot.set(r, [](std::unique_ptr<request> req) {
auto tag = req->get_query_param("tag");
auto column_family = req->get_query_param("cf");
std::vector<sstring> keynames = split(req->get_query_param("kn"), ",");
auto resp = make_ready_future<>();
if (column_family.empty()) {
resp = service::get_local_storage_service().take_snapshot(tag, keynames);
} else {
if (keynames.empty()) {
throw httpd::bad_param_exception("The keyspace of column families must be specified");
}
if (keynames.size() > 1) {
throw httpd::bad_param_exception("Only one keyspace allowed when specifying a column family");
}
resp = service::get_local_storage_service().take_column_family_snapshot(keynames[0], column_family, tag);
}
return resp.then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::del_snapshot.set(r, [](std::unique_ptr<request> req) {
auto tag = req->get_query_param("tag");
auto column_family = req->get_query_param("cf");
std::vector<sstring> keynames = split(req->get_query_param("kn"), ",");
return service::get_local_storage_service().clear_snapshot(tag, keynames, column_family).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::true_snapshots_size.set(r, [](std::unique_ptr<request> req) {
return service::get_local_storage_service().true_snapshots_size().then([] (int64_t size) {
return make_ready_future<json::json_return_type>(size);
});
});
ss::scrub.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<sstring> column_families) {
const auto skip_corrupted = req_param<bool>(*req, "skip_corrupted", false);
auto f = make_ready_future<>();
if (!req_param<bool>(*req, "disable_snapshot", false)) {
auto tag = format("pre-scrub-{:d}", db_clock::now().time_since_epoch().count());
f = parallel_for_each(column_families, [keyspace, tag](sstring cf) {
return service::get_local_storage_service().take_column_family_snapshot(keyspace, cf, tag);
});
}
return f.then([&ctx, keyspace, column_families, skip_corrupted] {
return ctx.db.invoke_on_all([=] (database& db) {
return do_for_each(column_families, [=, &db](sstring cfname) {
auto& cm = db.get_compaction_manager();
auto& cf = db.find_column_family(keyspace, cfname);
return cm.perform_sstable_scrub(&cf, skip_corrupted);
});
});
}).then([]{
return make_ready_future<json::json_return_type>(0);
});
}));
}
}

View File

@@ -26,5 +26,6 @@
namespace api {
void set_storage_service(http_context& ctx, routes& r);
void set_snapshot(http_context& ctx, routes& r);
}

View File

@@ -30,6 +30,10 @@ namespace api {
namespace hs = httpd::system_json;
void set_system(http_context& ctx, routes& r) {
hs::get_system_uptime.set(r, [](const_req req) {
return std::chrono::duration_cast<std::chrono::milliseconds>(engine().uptime()).count();
});
hs::get_all_logger_names.set(r, [](const_req req) {
return logging::logger_registry().get_all_logger_names();
});

View File

@@ -21,6 +21,7 @@
#include "atomic_cell.hh"
#include "atomic_cell_or_collection.hh"
#include "counters.hh"
#include "types.hh"
/// LSA mirator for cells with irrelevant type
@@ -214,6 +215,61 @@ size_t atomic_cell_or_collection::external_memory_usage(const abstract_type& t)
+ imr_object_type::size_overhead + external_value_size;
}
std::ostream&
operator<<(std::ostream& os, const atomic_cell_view& acv) {
if (acv.is_live()) {
return fmt_print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",
acv.is_counter_update()
? "counter_update_value=" + to_sstring(acv.counter_update_value())
: to_hex(acv.value().linearize()),
acv.timestamp(),
acv.is_live_and_has_ttl() ? acv.expiry().time_since_epoch().count() : -1,
acv.is_live_and_has_ttl() ? acv.ttl().count() : 0);
} else {
return fmt_print(os, "atomic_cell{{DEAD,ts={:d},deletion_time={:d}}}",
acv.timestamp(), acv.deletion_time().time_since_epoch().count());
}
}
std::ostream&
operator<<(std::ostream& os, const atomic_cell& ac) {
return os << atomic_cell_view(ac);
}
std::ostream&
operator<<(std::ostream& os, const atomic_cell_view::printer& acvp) {
auto& type = acvp._type;
auto& acv = acvp._cell;
if (acv.is_live()) {
std::ostringstream cell_value_string_builder;
if (type.is_counter()) {
if (acv.is_counter_update()) {
cell_value_string_builder << "counter_update_value=" << acv.counter_update_value();
} else {
cell_value_string_builder << "shards: ";
counter_cell_view::with_linearized(acv, [&cell_value_string_builder] (counter_cell_view& ccv) {
cell_value_string_builder << ::join(", ", ccv.shards());
});
}
} else {
cell_value_string_builder << type.to_string(acv.value().linearize());
}
return fmt_print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",
cell_value_string_builder.str(),
acv.timestamp(),
acv.is_live_and_has_ttl() ? acv.expiry().time_since_epoch().count() : -1,
acv.is_live_and_has_ttl() ? acv.ttl().count() : 0);
} else {
return fmt_print(os, "atomic_cell{{DEAD,ts={:d},deletion_time={:d}}}",
acv.timestamp(), acv.deletion_time().time_since_epoch().count());
}
}
std::ostream&
operator<<(std::ostream& os, const atomic_cell::printer& acp) {
return operator<<(os, static_cast<const atomic_cell_view::printer&>(acp));
}
std::ostream& operator<<(std::ostream& os, const atomic_cell_or_collection::printer& p) {
if (!p._cell._data.get()) {
return os << "{ null atomic_cell_or_collection }";
@@ -223,9 +279,9 @@ std::ostream& operator<<(std::ostream& os, const atomic_cell_or_collection::prin
if (dc::structure::get_member<dc::tags::flags>(p._cell._data.get()).get<dc::tags::collection>()) {
os << "collection ";
auto cmv = p._cell.as_collection_mutation();
os << to_hex(cmv.data.linearize());
os << collection_mutation_view::printer(*p._cdef.type, cmv);
} else {
os << p._cell.as_atomic_cell(p._cdef);
os << atomic_cell_view::printer(*p._cdef.type, p._cell.as_atomic_cell(p._cdef));
}
return os << " }";
}

View File

@@ -153,6 +153,14 @@ public:
}
friend std::ostream& operator<<(std::ostream& os, const atomic_cell_view& acv);
class printer {
const abstract_type& _type;
const atomic_cell_view& _cell;
public:
printer(const abstract_type& type, const atomic_cell_view& cell) : _type(type), _cell(cell) {}
friend std::ostream& operator<<(std::ostream& os, const printer& acvp);
};
};
class atomic_cell_mutable_view final : public basic_atomic_cell_view<mutable_view::yes> {
@@ -219,6 +227,12 @@ public:
static atomic_cell make_live_uninitialized(const abstract_type& type, api::timestamp_type timestamp, size_t size);
friend class atomic_cell_or_collection;
friend std::ostream& operator<<(std::ostream& os, const atomic_cell& ac);
class printer : atomic_cell_view::printer {
public:
printer(const abstract_type& type, const atomic_cell_view& cell) : atomic_cell_view::printer(type, cell) {}
friend std::ostream& operator<<(std::ostream& os, const printer& acvp);
};
};
class column_definition;

View File

@@ -52,7 +52,7 @@ public:
return make_ready_future<>();
}
virtual const sstring& qualified_java_name() const override {
virtual std::string_view qualified_java_name() const override {
return allow_all_authenticator_name();
}

View File

@@ -49,7 +49,7 @@ public:
return make_ready_future<>();
}
virtual const sstring& qualified_java_name() const override {
virtual std::string_view qualified_java_name() const override {
return allow_all_authorizer_name();
}

View File

@@ -96,7 +96,7 @@ public:
///
/// A fully-qualified (class with package) Java-like name for this implementation.
///
virtual const sstring& qualified_java_name() const = 0;
virtual std::string_view qualified_java_name() const = 0;
virtual bool require_authentication() const = 0;

View File

@@ -100,7 +100,7 @@ public:
///
/// A fully-qualified (class with package) Java-like name for this implementation.
///
virtual const sstring& qualified_java_name() const = 0;
virtual std::string_view qualified_java_name() const = 0;
///
/// Query for the permissions granted directly to a role for a particular \ref resource (and not any of its

View File

@@ -59,7 +59,7 @@ future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_f
}).discard_result();
}
future<> create_metadata_table_if_missing(
static future<> create_metadata_table_if_missing_impl(
std::string_view table_name,
cql3::query_processor& qp,
std::string_view cql,
@@ -85,7 +85,14 @@ future<> create_metadata_table_if_missing(
return ignore_existing([&mm, table = std::move(table)] () {
return mm.announce_new_column_family(table, false);
});
}
future<> create_metadata_table_if_missing(
std::string_view table_name,
cql3::query_processor& qp,
std::string_view cql,
::service::migration_manager& mm) noexcept {
return futurize_apply(create_metadata_table_if_missing_impl, table_name, qp, cql, mm);
}
future<> wait_for_schema_agreement(::service::migration_manager& mm, const database& db, seastar::abort_source& as) {

View File

@@ -79,7 +79,7 @@ future<> create_metadata_table_if_missing(
std::string_view table_name,
cql3::query_processor&,
std::string_view cql,
::service::migration_manager&);
::service::migration_manager&) noexcept;
future<> wait_for_schema_agreement(::service::migration_manager&, const database&, seastar::abort_source&);

View File

@@ -101,7 +101,7 @@ bool default_authorizer::legacy_metadata_exists() const {
future<bool> default_authorizer::any_granted() const {
static const sstring query = format("SELECT * FROM {}.{} LIMIT 1", meta::AUTH_KS, PERMISSIONS_CF);
return _qp.process(
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
@@ -115,7 +115,7 @@ future<> default_authorizer::migrate_legacy_metadata() const {
alogger.info("Starting migration of legacy permissions metadata.");
static const sstring query = format("SELECT * FROM {}.{}", meta::AUTH_KS, legacy_table_name);
return _qp.process(
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config).then([this](::shared_ptr<cql3::untyped_result_set> results) {
@@ -195,7 +195,7 @@ default_authorizer::authorize(const role_or_anonymous& maybe_role, const resourc
ROLE_NAME,
RESOURCE_NAME);
return _qp.process(
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
@@ -224,7 +224,7 @@ default_authorizer::modify(
ROLE_NAME,
RESOURCE_NAME),
[this, &role_name, set, &resource](const auto& query) {
return _qp.process(
return _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
@@ -249,7 +249,7 @@ future<std::vector<permission_details>> default_authorizer::list_all() const {
meta::AUTH_KS,
PERMISSIONS_CF);
return _qp.process(
return _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
@@ -276,7 +276,7 @@ future<> default_authorizer::revoke_all(std::string_view role_name) const {
PERMISSIONS_CF,
ROLE_NAME);
return _qp.process(
return _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
@@ -296,7 +296,7 @@ future<> default_authorizer::revoke_all(const resource& resource) const {
PERMISSIONS_CF,
RESOURCE_NAME);
return _qp.process(
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
@@ -313,7 +313,7 @@ future<> default_authorizer::revoke_all(const resource& resource) const {
ROLE_NAME,
RESOURCE_NAME);
return _qp.process(
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,

View File

@@ -71,7 +71,7 @@ public:
virtual future<> stop() override;
virtual const sstring& qualified_java_name() const override {
virtual std::string_view qualified_java_name() const override {
return default_authorizer_name();
}

View File

@@ -96,10 +96,13 @@ static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
return !row.get_or<sstring>(SALTED_HASH, "").empty();
}
static const sstring update_row_query = format("UPDATE {} SET {} = ? WHERE {} = ?",
meta::roles_table::qualified_name(),
SALTED_HASH,
meta::roles_table::role_col_name);
static const sstring& update_row_query() {
static const sstring update_row_query = format("UPDATE {} SET {} = ? WHERE {} = ?",
meta::roles_table::qualified_name(),
SALTED_HASH,
meta::roles_table::role_col_name);
return update_row_query;
}
static const sstring legacy_table_name{"credentials"};
@@ -111,7 +114,7 @@ future<> password_authenticator::migrate_legacy_metadata() const {
plogger.info("Starting migration of legacy authentication metadata.");
static const sstring query = format("SELECT * FROM {}.{}", meta::AUTH_KS, legacy_table_name);
return _qp.process(
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
@@ -119,8 +122,8 @@ future<> password_authenticator::migrate_legacy_metadata() const {
auto username = row.get_as<sstring>("username");
auto salted_hash = row.get_as<sstring>(SALTED_HASH);
return _qp.process(
update_row_query,
return _qp.execute_internal(
update_row_query(),
consistency_for_user(username),
internal_distributed_timeout_config(),
{std::move(salted_hash), username}).discard_result();
@@ -136,8 +139,8 @@ future<> password_authenticator::migrate_legacy_metadata() const {
future<> password_authenticator::create_default_if_missing() const {
return default_role_row_satisfies(_qp, &has_salted_hash).then([this](bool exists) {
if (!exists) {
return _qp.process(
update_row_query,
return _qp.execute_internal(
update_row_query(),
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
{passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt), DEFAULT_USER_NAME}).then([](auto&&) {
@@ -194,7 +197,7 @@ db::consistency_level password_authenticator::consistency_for_user(std::string_v
return db::consistency_level::LOCAL_ONE;
}
const sstring& password_authenticator::qualified_java_name() const {
std::string_view password_authenticator::qualified_java_name() const {
return password_authenticator_name();
}
@@ -233,7 +236,7 @@ future<authenticated_user> password_authenticator::authenticate(
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(
return _qp.execute_internal(
query,
consistency_for_user(username),
internal_distributed_timeout_config(),
@@ -267,8 +270,8 @@ future<> password_authenticator::create(std::string_view role_name, const authen
return make_ready_future<>();
}
return _qp.process(
update_row_query,
return _qp.execute_internal(
update_row_query(),
consistency_for_user(role_name),
internal_distributed_timeout_config(),
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();
@@ -284,7 +287,7 @@ future<> password_authenticator::alter(std::string_view role_name, const authent
SALTED_HASH,
meta::roles_table::role_col_name);
return _qp.process(
return _qp.execute_internal(
query,
consistency_for_user(role_name),
internal_distributed_timeout_config(),
@@ -297,7 +300,7 @@ future<> password_authenticator::drop(std::string_view name) const {
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(
return _qp.execute_internal(
query, consistency_for_user(name),
internal_distributed_timeout_config(),
{sstring(name)}).discard_result();

View File

@@ -71,7 +71,7 @@ public:
virtual future<> stop() override;
virtual const sstring& qualified_java_name() const override;
virtual std::string_view qualified_java_name() const override;
virtual bool require_authentication() const override;

View File

@@ -33,6 +33,7 @@
#include "auth/resource.hh"
#include "seastarx.hh"
#include "exceptions/exceptions.hh"
namespace auth {
@@ -52,9 +53,9 @@ struct role_config_update final {
///
/// A logical argument error for a role-management operation.
///
class roles_argument_exception : public std::invalid_argument {
class roles_argument_exception : public exceptions::invalid_request_exception {
public:
using std::invalid_argument::invalid_argument;
using exceptions::invalid_request_exception::invalid_request_exception;
};
class role_already_exists : public roles_argument_exception {

View File

@@ -68,14 +68,14 @@ future<bool> default_role_row_satisfies(
meta::roles_table::role_col_name);
return do_with(std::move(p), [&qp](const auto& p) {
return qp.process(
return qp.execute_internal(
query,
db::consistency_level::ONE,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return qp.process(
return qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
@@ -100,7 +100,7 @@ future<bool> any_nondefault_role_row_satisfies(
static const sstring query = format("SELECT * FROM {}", meta::roles_table::qualified_name());
return do_with(std::move(p), [&qp](const auto& p) {
return qp.process(
return qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([&p](::shared_ptr<cql3::untyped_result_set> results) {

View File

@@ -39,7 +39,7 @@
#include "db/consistency_level_type.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
#include "service/migration_listener.hh"
#include "service/migration_manager.hh"
#include "utils/class_registrator.hh"
#include "database.hh"
@@ -114,14 +114,14 @@ static future<> validate_role_exists(const service& ser, std::string_view role_n
service::service(
permissions_cache_config c,
cql3::query_processor& qp,
::service::migration_manager& mm,
::service::migration_notifier& mn,
std::unique_ptr<authorizer> z,
std::unique_ptr<authenticator> a,
std::unique_ptr<role_manager> r)
: _permissions_cache_config(std::move(c))
, _permissions_cache(nullptr)
, _qp(qp)
, _migration_manager(mm)
, _mnotifier(mn)
, _authorizer(std::move(z))
, _authenticator(std::move(a))
, _role_manager(std::move(r))
@@ -141,18 +141,19 @@ service::service(
service::service(
permissions_cache_config c,
cql3::query_processor& qp,
::service::migration_notifier& mn,
::service::migration_manager& mm,
const service_config& sc)
: service(
std::move(c),
qp,
mm,
mn,
create_object<authorizer>(sc.authorizer_java_name, qp, mm),
create_object<authenticator>(sc.authenticator_java_name, qp, mm),
create_object<role_manager>(sc.role_manager_java_name, qp, mm)) {
}
future<> service::create_keyspace_if_missing() const {
future<> service::create_keyspace_if_missing(::service::migration_manager& mm) const {
auto& db = _qp.db();
if (!db.has_keyspace(meta::AUTH_KS)) {
@@ -166,15 +167,15 @@ future<> service::create_keyspace_if_missing() const {
// We use min_timestamp so that default keyspace metadata will loose with any manual adjustments.
// See issue #2129.
return _migration_manager.announce_new_keyspace(ksm, api::min_timestamp, false);
return mm.announce_new_keyspace(ksm, api::min_timestamp, false);
}
return make_ready_future<>();
}
future<> service::start() {
return once_among_shards([this] {
return create_keyspace_if_missing();
future<> service::start(::service::migration_manager& mm) {
return once_among_shards([this, &mm] {
return create_keyspace_if_missing(mm);
}).then([this] {
return _role_manager->start().then([this] {
return when_all_succeed(_authorizer->start(), _authenticator->start());
@@ -183,7 +184,7 @@ future<> service::start() {
_permissions_cache = std::make_unique<permissions_cache>(_permissions_cache_config, *this, log);
}).then([this] {
return once_among_shards([this] {
_migration_manager.register_listener(_migration_listener.get());
_mnotifier.register_listener(_migration_listener.get());
return make_ready_future<>();
});
});
@@ -192,9 +193,12 @@ future<> service::start() {
future<> service::stop() {
// Only one of the shards has the listener registered, but let's try to
// unregister on each one just to make sure.
_migration_manager.unregister_listener(_migration_listener.get());
return _permissions_cache->stop().then([this] {
return _mnotifier.unregister_listener(_migration_listener.get()).then([this] {
if (_permissions_cache) {
return _permissions_cache->stop();
}
return make_ready_future<>();
}).then([this] {
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop());
});
}
@@ -216,7 +220,7 @@ future<bool> service::has_existing_legacy_users() const {
// This logic is borrowed directly from Apache Cassandra. By first checking for the presence of the default user, we
// can potentially avoid doing a range query with a high consistency level.
return _qp.process(
return _qp.execute_internal(
default_user_query,
db::consistency_level::ONE,
infinite_timeout_config,
@@ -226,7 +230,7 @@ future<bool> service::has_existing_legacy_users() const {
return make_ready_future<bool>(true);
}
return _qp.process(
return _qp.execute_internal(
default_user_query,
db::consistency_level::QUORUM,
infinite_timeout_config,
@@ -236,7 +240,7 @@ future<bool> service::has_existing_legacy_users() const {
return make_ready_future<bool>(true);
}
return _qp.process(
return _qp.execute_internal(
all_users_query,
db::consistency_level::QUORUM,
infinite_timeout_config).then([](auto results) {

View File

@@ -28,6 +28,7 @@
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
#include <seastar/util/bool_class.hh>
#include <seastar/core/sharded.hh>
#include "auth/authenticator.hh"
#include "auth/authorizer.hh"
@@ -42,6 +43,7 @@ class query_processor;
namespace service {
class migration_manager;
class migration_notifier;
class migration_listener;
}
@@ -76,13 +78,15 @@ public:
///
/// All state associated with access-control is stored externally to any particular instance of this class.
///
class service final {
/// peering_sharded_service inheritance is needed to be able to access shard local authentication service
/// given an object from another shard. Used for bouncing lwt requests to correct shard.
class service final : public seastar::peering_sharded_service<service> {
permissions_cache_config _permissions_cache_config;
std::unique_ptr<permissions_cache> _permissions_cache;
cql3::query_processor& _qp;
::service::migration_manager& _migration_manager;
::service::migration_notifier& _mnotifier;
std::unique_ptr<authorizer> _authorizer;
@@ -97,7 +101,7 @@ public:
service(
permissions_cache_config,
cql3::query_processor&,
::service::migration_manager&,
::service::migration_notifier&,
std::unique_ptr<authorizer>,
std::unique_ptr<authenticator>,
std::unique_ptr<role_manager>);
@@ -110,10 +114,11 @@ public:
service(
permissions_cache_config,
cql3::query_processor&,
::service::migration_notifier&,
::service::migration_manager&,
const service_config&);
future<> start();
future<> start(::service::migration_manager&);
future<> stop();
@@ -159,7 +164,7 @@ public:
private:
future<bool> has_existing_legacy_users() const;
future<> create_keyspace_if_missing() const;
future<> create_keyspace_if_missing(::service::migration_manager& mm) const;
};
future<bool> has_superuser(const service&, const authenticated_user&);

View File

@@ -35,6 +35,7 @@
#include "auth/common.hh"
#include "auth/roles-metadata.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "db/consistency_level_type.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
@@ -86,7 +87,7 @@ static future<std::optional<record>> find_record(cql3::query_processor& qp, std:
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return qp.process(
return qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
@@ -170,7 +171,7 @@ future<> standard_role_manager::create_default_role_if_missing() const {
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
@@ -197,7 +198,7 @@ future<> standard_role_manager::migrate_legacy_metadata() const {
log.info("Starting migration of legacy user metadata.");
static const sstring query = format("SELECT * FROM {}.{}", meta::AUTH_KS, legacy_table_name);
return _qp.process(
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
@@ -258,7 +259,7 @@ future<> standard_role_manager::create_or_replace(std::string_view role_name, co
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(
return _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
@@ -298,7 +299,7 @@ standard_role_manager::alter(std::string_view role_name, const role_config_updat
return make_ready_future<>();
}
return _qp.process(
return _qp.execute_internal(
format("UPDATE {} SET {} WHERE {} = ?",
meta::roles_table::qualified_name(),
build_column_assignments(u),
@@ -320,7 +321,7 @@ future<> standard_role_manager::drop(std::string_view role_name) const {
static const sstring query = format("SELECT member FROM {} WHERE role = ?",
meta::role_members_table::qualified_name());
return _qp.process(
return _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
@@ -359,7 +360,7 @@ future<> standard_role_manager::drop(std::string_view role_name) const {
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(
return _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
@@ -386,7 +387,7 @@ standard_role_manager::modify_membership(
(ch == membership_change::add ? '+' : '-'),
meta::roles_table::role_col_name);
return _qp.process(
return _qp.execute_internal(
query,
consistency_for_role(grantee_name),
internal_distributed_timeout_config(),
@@ -396,7 +397,7 @@ standard_role_manager::modify_membership(
const auto modify_role_members = [this, role_name, grantee_name, ch] {
switch (ch) {
case membership_change::add:
return _qp.process(
return _qp.execute_internal(
format("INSERT INTO {} (role, member) VALUES (?, ?)",
meta::role_members_table::qualified_name()),
consistency_for_role(role_name),
@@ -404,7 +405,7 @@ standard_role_manager::modify_membership(
{sstring(role_name), sstring(grantee_name)}).discard_result();
case membership_change::remove:
return _qp.process(
return _qp.execute_internal(
format("DELETE FROM {} WHERE role = ? AND member = ?",
meta::role_members_table::qualified_name()),
consistency_for_role(role_name),
@@ -508,7 +509,7 @@ future<role_set> standard_role_manager::query_all() const {
// To avoid many copies of a view.
static const auto role_col_name_string = sstring(meta::roles_table::role_col_name);
return _qp.process(
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([](::shared_ptr<cql3::untyped_result_set> results) {

View File

@@ -82,7 +82,7 @@ public:
return _authenticator->stop();
}
virtual const sstring& qualified_java_name() const override {
virtual std::string_view qualified_java_name() const override {
return transitional_authenticator_name();
}
@@ -201,7 +201,7 @@ public:
return _authorizer->stop();
}
virtual const sstring& qualified_java_name() const override {
virtual std::string_view qualified_java_name() const override {
return transitional_authorizer_name();
}

View File

@@ -23,7 +23,11 @@
#include <seastar/core/scheduling.hh>
#include <seastar/core/timer.hh>
#include <seastar/core/gate.hh>
#include <seastar/core/file.hh>
#include <chrono>
#include <cmath>
#include "seastarx.hh"
// Simple proportional controller to adjust shares for processes for which a backlog can be clearly
// defined.

72
build_id.cc Normal file
View File

@@ -0,0 +1,72 @@
/*
* Copyright (C) 2019 ScyllaDB
*/
#include "build_id.hh"
#include <fmt/printf.h>
#include <link.h>
#include <seastar/core/align.hh>
#include <sstream>
#include <cassert>
using namespace seastar;
static const Elf64_Nhdr* get_nt_build_id(dl_phdr_info* info) {
auto base = info->dlpi_addr;
const auto* h = info->dlpi_phdr;
auto num_headers = info->dlpi_phnum;
for (int i = 0; i != num_headers; ++i, ++h) {
if (h->p_type != PT_NOTE) {
continue;
}
auto* p = reinterpret_cast<const char*>(base) + h->p_vaddr;
auto* e = p + h->p_memsz;
while (p != e) {
const auto* n = reinterpret_cast<const Elf64_Nhdr*>(p);
if (n->n_type == NT_GNU_BUILD_ID) {
return n;
}
p += sizeof(Elf64_Nhdr);
p += n->n_namesz;
p = align_up(p, 4);
p += n->n_descsz;
p = align_up(p, 4);
}
}
assert(0 && "no NT_GNU_BUILD_ID note");
}
static int callback(dl_phdr_info* info, size_t size, void* data) {
std::string& ret = *(std::string*)data;
std::ostringstream os;
// The first DSO is always the main program, which has an empty name.
assert(strlen(info->dlpi_name) == 0);
auto* n = get_nt_build_id(info);
auto* p = reinterpret_cast<const char*>(n);
p += sizeof(Elf64_Nhdr);
p += n->n_namesz;
p = align_up(p, 4);
const char* desc = p;
for (unsigned i = 0; i < n->n_descsz; ++i) {
fmt::fprintf(os, "%02x", (unsigned char)*(desc + i));
}
ret = os.str();
return 1;
}
std::string get_build_id() {
std::string ret;
int r = dl_iterate_phdr(callback, &ret);
assert(r == 1);
return ret;
}

9
build_id.hh Normal file
View File

@@ -0,0 +1,9 @@
/*
* Copyright (C) 2019 ScyllaDB
*/
#pragma once
#include <string>
std::string get_build_id();

View File

@@ -64,7 +64,7 @@ bytes from_hex(sstring_view s) {
sstring to_hex(bytes_view b) {
static char digits[] = "0123456789abcdef";
sstring out(sstring::initialized_later(), b.size() * 2);
sstring out = uninitialized_string(b.size() * 2);
unsigned end = b.size();
for (unsigned i = 0; i != end; ++i) {
uint8_t x = b[i];

View File

@@ -38,6 +38,7 @@ class bytes_ostream {
public:
using size_type = bytes::size_type;
using value_type = bytes::value_type;
using fragment_type = bytes_view;
static constexpr size_type max_chunk_size() { return 128 * 1024; }
private:
static_assert(sizeof(value_type) == 1, "value_type is assumed to be one byte long");
@@ -93,6 +94,29 @@ public:
return _current != other._current;
}
};
using const_iterator = fragment_iterator;
class output_iterator {
public:
using iterator_category = std::output_iterator_tag;
using difference_type = std::ptrdiff_t;
using value_type = bytes_ostream::value_type;
using pointer = bytes_ostream::value_type*;
using reference = bytes_ostream::value_type&;
friend class bytes_ostream;
private:
bytes_ostream* _ostream = nullptr;
private:
explicit output_iterator(bytes_ostream& os) : _ostream(&os) { }
public:
reference operator*() const { return *_ostream->write_place_holder(1); }
output_iterator& operator++() { return *this; }
output_iterator operator++(int) { return *this; }
};
private:
inline size_type current_space_left() const {
if (!_current) {
@@ -289,6 +313,11 @@ public:
return _size;
}
// For the FragmentRange concept
size_type size_bytes() const {
return _size;
}
bool empty() const {
return _size == 0;
}
@@ -326,6 +355,8 @@ public:
fragment_iterator begin() const { return { _begin.get() }; }
fragment_iterator end() const { return { nullptr }; }
output_iterator write_begin() { return output_iterator(*this); }
boost::iterator_range<fragment_iterator> fragments() const {
return { begin(), end() };
}

View File

@@ -35,6 +35,7 @@
#include "idl/uuid.dist.impl.hh"
#include "idl/keys.dist.impl.hh"
#include "idl/mutation.dist.impl.hh"
#include <iostream>
canonical_mutation::canonical_mutation(bytes data)
: _data(std::move(data))
@@ -89,3 +90,81 @@ mutation canonical_mutation::to_mutation(schema_ptr s) const {
}
return m;
}
static sstring bytes_to_text(bytes_view bv) {
sstring ret = uninitialized_string(bv.size());
std::copy_n(reinterpret_cast<const char*>(bv.data()), bv.size(), ret.data());
return ret;
}
std::ostream& operator<<(std::ostream& os, const canonical_mutation& cm) {
auto in = ser::as_input_stream(cm._data);
auto mv = ser::deserialize(in, boost::type<ser::canonical_mutation_view>());
column_mapping mapping = mv.mapping();
auto partition_view = mutation_partition_view::from_view(mv.partition());
fmt::print(os, "{{canonical_mutation: ");
fmt::print(os, "table_id {} schema_version {} ", mv.table_id(), mv.schema_version());
fmt::print(os, "partition_key {} ", mv.key());
class printing_visitor : public mutation_partition_view_virtual_visitor {
std::ostream& _os;
const column_mapping& _cm;
bool _first = true;
bool _in_row = false;
private:
void print_separator() {
if (!_first) {
fmt::print(_os, ", ");
}
_first = false;
}
public:
printing_visitor(std::ostream& os, const column_mapping& cm) : _os(os), _cm(cm) {}
virtual void accept_partition_tombstone(tombstone t) override {
print_separator();
fmt::print(_os, "partition_tombstone {}", t);
}
virtual void accept_static_cell(column_id id, atomic_cell ac) override {
print_separator();
auto&& entry = _cm.static_column_at(id);
fmt::print(_os, "static column {} {}", bytes_to_text(entry.name()), atomic_cell::printer(*entry.type(), ac));
}
virtual void accept_static_cell(column_id id, collection_mutation_view cmv) override {
print_separator();
auto&& entry = _cm.static_column_at(id);
fmt::print(_os, "static column {} {}", bytes_to_text(entry.name()), collection_mutation_view::printer(*entry.type(), cmv));
}
virtual void accept_row_tombstone(range_tombstone rt) override {
print_separator();
fmt::print(_os, "row tombstone {}", rt);
}
virtual void accept_row(position_in_partition_view pipv, row_tombstone rt, row_marker rm, is_dummy, is_continuous) override {
if (_in_row) {
fmt::print(_os, "}}, ");
}
fmt::print(_os, "{{row {} tombstone {} marker {}", pipv, rt, rm);
_in_row = true;
_first = false;
}
virtual void accept_row_cell(column_id id, atomic_cell ac) override {
print_separator();
auto&& entry = _cm.regular_column_at(id);
fmt::print(_os, "column {} {}", bytes_to_text(entry.name()), atomic_cell::printer(*entry.type(), ac));
}
virtual void accept_row_cell(column_id id, collection_mutation_view cmv) override {
print_separator();
auto&& entry = _cm.regular_column_at(id);
fmt::print(_os, "column {} {}", bytes_to_text(entry.name()), collection_mutation_view::printer(*entry.type(), cmv));
}
void finalize() {
if (_in_row) {
fmt::print(_os, "}}");
}
}
};
printing_visitor pv(os, mapping);
partition_view.accept(mapping, pv);
pv.finalize();
fmt::print(os, "}}");
return os;
}

View File

@@ -22,10 +22,11 @@
#pragma once
#include "bytes.hh"
#include "schema.hh"
#include "schema_fwd.hh"
#include "database_fwd.hh"
#include "mutation_partition_visitor.hh"
#include "mutation_partition_serializer.hh"
#include <iosfwd>
// Immutable mutation form which can be read using any schema version of the same table.
// Safe to access from other shards via const&.
@@ -52,4 +53,5 @@ public:
const bytes& representation() const { return _data; }
friend std::ostream& operator<<(std::ostream& os, const canonical_mutation& cm);
};

View File

@@ -22,6 +22,9 @@
#pragma once
#include <vector>
#include <sys/types.h>
// Single-pass range over cartesian product of vectors.
// Note:

View File

@@ -1,604 +0,0 @@
/*
* Copyright (C) 2019 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <utility>
#include <algorithm>
#include <seastar/util/defer.hh>
#include <seastar/core/thread.hh>
#include "cdc/cdc.hh"
#include "bytes.hh"
#include "database.hh"
#include "db/config.hh"
#include "dht/murmur3_partitioner.hh"
#include "partition_slice_builder.hh"
#include "schema.hh"
#include "schema_builder.hh"
#include "service/migration_manager.hh"
#include "service/storage_service.hh"
#include "types/tuple.hh"
#include "cql3/statements/select_statement.hh"
#include "cql3/multi_column_relation.hh"
#include "cql3/tuples.hh"
#include "log.hh"
using locator::snitch_ptr;
using locator::token_metadata;
using locator::topology;
using seastar::sstring;
using service::migration_manager;
using service::storage_proxy;
namespace std {
template<> struct hash<std::pair<net::inet_address, unsigned int>> {
std::size_t operator()(const std::pair<net::inet_address, unsigned int> &p) const {
return std::hash<net::inet_address>{}(p.first) ^ std::hash<int>{}(p.second);
}
};
}
using namespace std::chrono_literals;
static logging::logger cdc_log("cdc");
namespace cdc {
using operation_native_type = std::underlying_type_t<operation>;
using column_op_native_type = std::underlying_type_t<column_op>;
sstring log_name(const sstring& table_name) {
static constexpr auto cdc_log_suffix = "_scylla_cdc_log";
return table_name + cdc_log_suffix;
}
sstring desc_name(const sstring& table_name) {
static constexpr auto cdc_desc_suffix = "_scylla_cdc_desc";
return table_name + cdc_desc_suffix;
}
static future<>
remove_log(db_context ctx, const sstring& ks_name, const sstring& table_name) {
try {
return ctx._migration_manager.announce_column_family_drop(
ks_name, log_name(table_name), false);
} catch (exceptions::configuration_exception& e) {
// It's fine if the table does not exist.
return make_ready_future<>();
} catch (...) {
return make_exception_future<>(std::current_exception());
}
}
static future<>
remove_desc(db_context ctx, const sstring& ks_name, const sstring& table_name) {
try {
return ctx._migration_manager.announce_column_family_drop(
ks_name, desc_name(table_name), false);
} catch (exceptions::configuration_exception& e) {
// It's fine if the table does not exist.
return make_ready_future<>();
} catch (...) {
return make_exception_future<>(std::current_exception());
}
}
future<>
remove(db_context ctx, const sstring& ks_name, const sstring& table_name) {
return when_all(remove_log(ctx, ks_name, table_name),
remove_desc(ctx, ks_name, table_name)).discard_result();
}
static future<> setup_log(db_context ctx, const schema& s) {
schema_builder b(s.ks_name(), log_name(s.cf_name()));
b.set_default_time_to_live(gc_clock::duration{s.cdc_options().ttl()});
b.set_comment(sprint("CDC log for %s.%s", s.ks_name(), s.cf_name()));
b.with_column("stream_id", uuid_type, column_kind::partition_key);
b.with_column("time", timeuuid_type, column_kind::clustering_key);
b.with_column("batch_seq_no", int32_type, column_kind::clustering_key);
b.with_column("operation", data_type_for<operation_native_type>());
b.with_column("ttl", long_type);
auto add_columns = [&] (const schema::const_iterator_range_type& columns, bool is_data_col = false) {
for (const auto& column : columns) {
auto type = column.type;
if (is_data_col) {
type = tuple_type_impl::get_instance({ /* op */ data_type_for<column_op_native_type>(), /* value */ type, /* ttl */long_type});
}
b.with_column("_" + column.name(), type);
}
};
add_columns(s.partition_key_columns());
add_columns(s.clustering_key_columns());
add_columns(s.static_columns(), true);
add_columns(s.regular_columns(), true);
return ctx._migration_manager.announce_new_column_family(b.build(), false);
}
static future<> setup_stream_description_table(db_context ctx, const schema& s) {
schema_builder b(s.ks_name(), desc_name(s.cf_name()));
b.set_comment(sprint("CDC description for %s.%s", s.ks_name(), s.cf_name()));
b.with_column("node_ip", inet_addr_type, column_kind::partition_key);
b.with_column("shard_id", int32_type, column_kind::partition_key);
b.with_column("created_at", timestamp_type, column_kind::clustering_key);
b.with_column("stream_id", uuid_type);
return ctx._migration_manager.announce_new_column_family(b.build(), false);
}
// This function assumes setup_stream_description_table was called on |s| before the call to this
// function.
static future<> populate_desc(db_context ctx, const schema& s) {
auto& db = ctx._proxy.get_db().local();
auto desc_schema =
db.find_schema(s.ks_name(), desc_name(s.cf_name()));
auto log_schema =
db.find_schema(s.ks_name(), log_name(s.cf_name()));
auto belongs_to = [&](const gms::inet_address& endpoint,
const unsigned int shard_id,
const int shard_count,
const unsigned int ignore_msb_bits,
const utils::UUID& stream_id) {
const auto log_pk = partition_key::from_singular(*log_schema,
data_value(stream_id));
const auto token = ctx._partitioner.decorate_key(*log_schema, log_pk).token();
if (ctx._token_metadata.get_endpoint(ctx._token_metadata.first_token(token)) != endpoint) {
return false;
}
const auto owning_shard_id = dht::murmur3_partitioner(shard_count, ignore_msb_bits).shard_of(token);
return owning_shard_id == shard_id;
};
std::vector<mutation> mutations;
const auto ts = api::new_timestamp();
const auto ck = clustering_key::from_single_value(
*desc_schema, timestamp_type->decompose(ts));
auto cdef = desc_schema->get_column_definition(to_bytes("stream_id"));
for (const auto& dc : ctx._token_metadata.get_topology().get_datacenter_endpoints()) {
for (const auto& endpoint : dc.second) {
const auto decomposed_ip = inet_addr_type->decompose(endpoint.addr());
const unsigned int shard_count = ctx._snitch->get_shard_count(endpoint);
const unsigned int ignore_msb_bits = ctx._snitch->get_ignore_msb_bits(endpoint);
for (unsigned int shard_id = 0; shard_id < shard_count; ++shard_id) {
const auto pk = partition_key::from_exploded(
*desc_schema, { decomposed_ip, int32_type->decompose(static_cast<int>(shard_id)) });
mutations.emplace_back(desc_schema, pk);
auto stream_id = utils::make_random_uuid();
while (!belongs_to(endpoint, shard_id, shard_count, ignore_msb_bits, stream_id)) {
stream_id = utils::make_random_uuid();
}
auto value = atomic_cell::make_live(*uuid_type,
ts,
uuid_type->decompose(stream_id));
mutations.back().set_cell(ck, *cdef, std::move(value));
}
}
}
return ctx._proxy.mutate(std::move(mutations),
db::consistency_level::QUORUM,
db::no_timeout,
nullptr,
empty_service_permit());
}
future<> setup(db_context ctx, schema_ptr s) {
return seastar::async([ctx = std::move(ctx), s = std::move(s)] {
setup_log(ctx, *s).get();
auto log_guard = seastar::defer([&] { remove_log(ctx, s->ks_name(), s->cf_name()).get(); });
setup_stream_description_table(ctx, *s).get();
auto desc_guard = seastar::defer([&] { remove_desc(ctx, s->ks_name(), s->cf_name()).get(); });
populate_desc(ctx, *s).get();
desc_guard.cancel();
log_guard.cancel();
});
}
db_context db_context::builder::build() {
return db_context{
_proxy,
_migration_manager ? _migration_manager->get() : service::get_local_migration_manager(),
_token_metadata ? _token_metadata->get() : service::get_local_storage_service().get_token_metadata(),
_snitch ? _snitch->get() : locator::i_endpoint_snitch::get_local_snitch_ptr(),
_partitioner ? _partitioner->get() : dht::global_partitioner()
};
}
class transformer final {
public:
using streams_type = std::unordered_map<std::pair<net::inet_address, unsigned int>, utils::UUID>;
private:
db_context _ctx;
schema_ptr _schema;
schema_ptr _log_schema;
utils::UUID _time;
bytes _decomposed_time;
::shared_ptr<const transformer::streams_type> _streams;
const column_definition& _op_col;
clustering_key set_pk_columns(const partition_key& pk, int batch_no, mutation& m) const {
const auto log_ck = clustering_key::from_exploded(
*m.schema(), { _decomposed_time, int32_type->decompose(batch_no) });
auto pk_value = pk.explode(*_schema);
size_t pos = 0;
for (const auto& column : _schema->partition_key_columns()) {
assert (pos < pk_value.size());
auto cdef = m.schema()->get_column_definition(to_bytes("_" + column.name()));
auto value = atomic_cell::make_live(*column.type,
_time.timestamp(),
bytes_view(pk_value[pos]));
m.set_cell(log_ck, *cdef, std::move(value));
++pos;
}
return log_ck;
}
void set_operation(const clustering_key& ck, operation op, mutation& m) const {
m.set_cell(ck, _op_col, atomic_cell::make_live(*_op_col.type, _time.timestamp(), _op_col.type->decompose(operation_native_type(op))));
}
partition_key stream_id(const net::inet_address& ip, unsigned int shard_id) const {
auto it = _streams->find(std::make_pair(ip, shard_id));
if (it == std::end(*_streams)) {
throw std::runtime_error(format("No stream found for node {} and shard {}", ip, shard_id));
}
return partition_key::from_exploded(*_log_schema, { uuid_type->decompose(it->second) });
}
public:
transformer(db_context ctx, schema_ptr s, ::shared_ptr<const transformer::streams_type> streams)
: _ctx(ctx)
, _schema(std::move(s))
, _log_schema(ctx._proxy.get_db().local().find_schema(_schema->ks_name(), log_name(_schema->cf_name())))
, _time(utils::UUID_gen::get_time_UUID())
, _decomposed_time(timeuuid_type->decompose(_time))
, _streams(std::move(streams))
, _op_col(*_log_schema->get_column_definition(to_bytes("operation")))
{}
// TODO: is pre-image data based on query enough. We only have actual column data. Do we need
// more details like tombstones/ttl? Probably not but keep in mind.
mutation transform(const mutation& m, const cql3::untyped_result_set* rs = nullptr) const {
auto& t = m.token();
auto&& ep = _ctx._token_metadata.get_endpoint(
_ctx._token_metadata.first_token(t));
if (!ep) {
throw std::runtime_error(format("No owner found for key {}", m.decorated_key()));
}
auto shard_id = dht::murmur3_partitioner(_ctx._snitch->get_shard_count(*ep), _ctx._snitch->get_ignore_msb_bits(*ep)).shard_of(t);
mutation res(_log_schema, stream_id(ep->addr(), shard_id));
auto& p = m.partition();
if (p.partition_tombstone()) {
// Partition deletion
auto log_ck = set_pk_columns(m.key(), 0, res);
set_operation(log_ck, operation::partition_delete, res);
} else if (!p.row_tombstones().empty()) {
// range deletion
int batch_no = 0;
for (auto& rt : p.row_tombstones()) {
auto set_bound = [&] (const clustering_key& log_ck, const clustering_key_prefix& ckp) {
auto exploded = ckp.explode(*_schema);
size_t pos = 0;
for (const auto& column : _schema->clustering_key_columns()) {
if (pos >= exploded.size()) {
break;
}
auto cdef = _log_schema->get_column_definition(to_bytes("_" + column.name()));
auto value = atomic_cell::make_live(*column.type,
_time.timestamp(),
bytes_view(exploded[pos]));
res.set_cell(log_ck, *cdef, std::move(value));
++pos;
}
};
{
auto log_ck = set_pk_columns(m.key(), batch_no, res);
set_bound(log_ck, rt.start);
// TODO: separate inclusive/exclusive range
set_operation(log_ck, operation::range_delete_start, res);
++batch_no;
}
{
auto log_ck = set_pk_columns(m.key(), batch_no, res);
set_bound(log_ck, rt.end);
// TODO: separate inclusive/exclusive range
set_operation(log_ck, operation::range_delete_end, res);
++batch_no;
}
}
} else {
// should be update or deletion
int batch_no = 0;
for (const rows_entry& r : p.clustered_rows()) {
auto ck_value = r.key().explode(*_schema);
std::optional<clustering_key> pikey;
const cql3::untyped_result_set_row * pirow = nullptr;
if (rs) {
for (auto& utr : *rs) {
bool match = true;
for (auto& c : _schema->clustering_key_columns()) {
auto rv = utr.get_view(c.name_as_text());
auto cv = r.key().get_component(*_schema, c.component_index());
if (rv != cv) {
match = false;
break;
}
}
if (match) {
pikey = set_pk_columns(m.key(), batch_no, res);
set_operation(*pikey, operation::pre_image, res);
pirow = &utr;
++batch_no;
break;
}
}
}
auto log_ck = set_pk_columns(m.key(), batch_no, res);
size_t pos = 0;
for (const auto& column : _schema->clustering_key_columns()) {
assert (pos < ck_value.size());
auto cdef = _log_schema->get_column_definition(to_bytes("_" + column.name()));
res.set_cell(log_ck, *cdef, atomic_cell::make_live(*column.type, _time.timestamp(), bytes_view(ck_value[pos])));
if (pirow) {
assert(pirow->has(column.name_as_text()));
res.set_cell(*pikey, *cdef, atomic_cell::make_live(*column.type, _time.timestamp(), bytes_view(ck_value[pos])));
}
++pos;
}
std::vector<bytes_opt> values(3);
auto process_cells = [&](const row& r, column_kind ckind) {
r.for_each_cell([&](column_id id, const atomic_cell_or_collection& cell) {
auto& cdef = _schema->column_at(ckind, id);
auto* dst = _log_schema->get_column_definition(to_bytes("_" + cdef.name()));
// todo: collections.
if (cdef.is_atomic()) {
column_op op;
values[1] = values[2] = std::nullopt;
auto view = cell.as_atomic_cell(cdef);
if (view.is_live()) {
op = column_op::set;
values[1] = view.value().linearize();
if (view.is_live_and_has_ttl()) {
values[2] = long_type->decompose(data_value(view.ttl().count()));
}
} else {
op = column_op::del;
}
values[0] = data_type_for<column_op_native_type>()->decompose(data_value(static_cast<column_op_native_type>(op)));
res.set_cell(log_ck, *dst, atomic_cell::make_live(*dst->type, _time.timestamp(), tuple_type_impl::build_value(values)));
if (pirow && pirow->has(cdef.name_as_text())) {
values[0] = data_type_for<column_op_native_type>()->decompose(data_value(static_cast<column_op_native_type>(column_op::set)));
values[1] = pirow->get_blob(cdef.name_as_text());
values[2] = std::nullopt;
assert(std::addressof(res.partition().clustered_row(*_log_schema, *pikey)) != std::addressof(res.partition().clustered_row(*_log_schema, log_ck)));
assert(pikey->explode() != log_ck.explode());
res.set_cell(*pikey, *dst, atomic_cell::make_live(*dst->type, _time.timestamp(), tuple_type_impl::build_value(values)));
}
} else {
cdc_log.warn("Non-atomic cell ignored {}.{}:{}", _schema->ks_name(), _schema->cf_name(), cdef.name_as_text());
}
});
};
process_cells(r.row().cells(), column_kind::regular_column);
process_cells(p.static_row().get(), column_kind::static_column);
set_operation(log_ck, operation::update, res);
++batch_no;
}
}
return res;
}
static db::timeout_clock::time_point default_timeout() {
return db::timeout_clock::now() + 10s;
}
future<lw_shared_ptr<cql3::untyped_result_set>> pre_image_select(
service::storage_proxy& proxy,
service::client_state& client_state,
db::consistency_level cl,
const mutation& m)
{
auto& p = m.partition();
if (p.partition_tombstone() || !p.row_tombstones().empty() || p.clustered_rows().empty()) {
return make_ready_future<lw_shared_ptr<cql3::untyped_result_set>>();
}
dht::partition_range_vector partition_ranges{dht::partition_range(m.decorated_key())};
auto&& pc = _schema->partition_key_columns();
auto&& cc = _schema->clustering_key_columns();
std::vector<query::clustering_range> bounds;
if (cc.empty()) {
bounds.push_back(query::clustering_range::make_open_ended_both_sides());
} else {
for (const rows_entry& r : p.clustered_rows()) {
auto& ck = r.key();
bounds.push_back(query::clustering_range::make_singular(ck));
}
}
std::vector<const column_definition*> columns;
columns.reserve(_schema->all_columns().size());
std::transform(pc.begin(), pc.end(), std::back_inserter(columns), [](auto& c) { return &c; });
std::transform(cc.begin(), cc.end(), std::back_inserter(columns), [](auto& c) { return &c; });
query::column_id_vector static_columns, regular_columns;
auto sk = column_kind::static_column;
auto rk = column_kind::regular_column;
// TODO: this assumes all mutations touch the same set of columns. This might not be true, and we may need to do more horrible set operation here.
for (auto& [r, cids, kind] : { std::tie(p.static_row().get(), static_columns, sk), std::tie(p.clustered_rows().begin()->row().cells(), regular_columns, rk) }) {
r.for_each_cell([&](column_id id, const atomic_cell_or_collection&) {
auto& cdef =_schema->column_at(kind, id);
cids.emplace_back(id);
columns.emplace_back(&cdef);
});
}
auto selection = cql3::selection::selection::for_columns(_schema, std::move(columns));
auto partition_slice = query::partition_slice(std::move(bounds), std::move(static_columns), std::move(regular_columns), selection->get_query_options());
auto command = ::make_lw_shared<query::read_command>(_schema->id(), _schema->version(), partition_slice, query::max_partitions);
return proxy.query(_schema, std::move(command), std::move(partition_ranges), cl, service::storage_proxy::coordinator_query_options(default_timeout(), empty_service_permit(), client_state)).then(
[this, partition_slice = std::move(partition_slice), selection = std::move(selection)] (service::storage_proxy::coordinator_query_result qr) -> lw_shared_ptr<cql3::untyped_result_set> {
cql3::selection::result_set_builder builder(*selection, gc_clock::now(), cql_serialization_format::latest());
query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *_schema, *selection));
auto result_set = builder.build();
if (!result_set || result_set->empty()) {
return {};
}
return make_lw_shared<cql3::untyped_result_set>(*result_set);
});
}
};
// This class is used to build a mapping from <node ip, shard id> to stream_id
// It is used as a consumer for rows returned by the query to CDC Description Table
class streams_builder {
const schema& _schema;
transformer::streams_type _streams;
net::inet_address _node_ip = net::inet_address();
unsigned int _shard_id = 0;
api::timestamp_type _latest_row_timestamp = api::min_timestamp;
utils::UUID _latest_row_stream_id = utils::UUID();
public:
streams_builder(const schema& s) : _schema(s) {}
void accept_new_partition(const partition_key& key, uint32_t row_count) {
auto exploded = key.explode(_schema);
_node_ip = value_cast<net::inet_address>(inet_addr_type->deserialize(exploded[0]));
_shard_id = static_cast<unsigned int>(value_cast<int>(int32_type->deserialize(exploded[1])));
_latest_row_timestamp = api::min_timestamp;
_latest_row_stream_id = utils::UUID();
}
void accept_new_partition(uint32_t row_count) {
assert(false);
}
void accept_new_row(
const clustering_key& key,
const query::result_row_view& static_row,
const query::result_row_view& row) {
auto row_iterator = row.iterator();
api::timestamp_type timestamp = value_cast<db_clock::time_point>(
timestamp_type->deserialize(key.explode(_schema)[0])).time_since_epoch().count();
if (timestamp <= _latest_row_timestamp) {
return;
}
_latest_row_timestamp = timestamp;
for (auto&& cdef : _schema.regular_columns()) {
if (cdef.name_as_text() != "stream_id") {
row_iterator.skip(cdef);
continue;
}
auto val_opt = row_iterator.next_atomic_cell();
assert(val_opt);
val_opt->value().with_linearized([&] (bytes_view bv) {
_latest_row_stream_id = value_cast<utils::UUID>(uuid_type->deserialize(bv));
});
}
}
void accept_new_row(const query::result_row_view& static_row, const query::result_row_view& row) {
assert(false);
}
void accept_partition_end(const query::result_row_view& static_row) {
_streams.emplace(std::make_pair(_node_ip, _shard_id), _latest_row_stream_id);
}
transformer::streams_type build() {
return std::move(_streams);
}
};
static future<::shared_ptr<transformer::streams_type>> get_streams(
db_context ctx,
const sstring& ks_name,
const sstring& cf_name,
lowres_clock::time_point timeout,
service::query_state& qs) {
auto s =
ctx._proxy.get_db().local().find_schema(ks_name, desc_name(cf_name));
query::read_command cmd(
s->id(),
s->version(),
partition_slice_builder(*s).with_no_static_columns().build());
return ctx._proxy.query(
s,
make_lw_shared(std::move(cmd)),
{dht::partition_range::make_open_ended_both_sides()},
db::consistency_level::QUORUM,
{timeout, qs.get_permit(), qs.get_client_state()}).then([s = std::move(s)] (auto qr) mutable {
return query::result_view::do_with(*qr.query_result,
[s = std::move(s)] (query::result_view v) {
auto slice = partition_slice_builder(*s)
.with_no_static_columns()
.build();
streams_builder builder{ *s };
v.consume(slice, builder);
return ::make_shared<transformer::streams_type>(builder.build());
});
});
}
future<std::vector<mutation>> append_log_mutations(
db_context ctx,
schema_ptr s,
service::storage_proxy::clock_type::time_point timeout,
service::query_state& qs,
std::vector<mutation> muts) {
auto mp = ::make_lw_shared<std::vector<mutation>>(std::move(muts));
return get_streams(ctx, s->ks_name(), s->cf_name(), timeout, qs).then([ctx, s = std::move(s), mp, &qs](::shared_ptr<transformer::streams_type> streams) mutable {
mp->reserve(2 * mp->size());
auto trans = make_lw_shared<transformer>(ctx, s, std::move(streams));
auto i = mp->begin();
auto e = mp->end();
return parallel_for_each(i, e, [ctx, &qs, trans, mp](mutation& m) {
return trans->pre_image_select(ctx._proxy, qs.get_client_state(), db::consistency_level::LOCAL_QUORUM, m).then([trans, mp, &m](lw_shared_ptr<cql3::untyped_result_set> rs) {
mp->push_back(trans->transform(m, rs.get()));
});
}).then([mp] {
return std::move(*mp);
});
});
}
} // namespace cdc

View File

@@ -1,233 +0,0 @@
/*
* Copyright (C) 2019 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <functional>
#include <optional>
#include <map>
#include <string>
#include <vector>
#include <seastar/core/future.hh>
#include <seastar/core/lowres_clock.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/sstring.hh>
#include "exceptions/exceptions.hh"
#include "json.hh"
#include "timestamp.hh"
class schema;
using schema_ptr = seastar::lw_shared_ptr<const schema>;
namespace locator {
class snitch_ptr;
class token_metadata;
} // namespace locator
namespace service {
class migration_manager;
class storage_proxy;
class query_state;
} // namespace service
namespace dht {
class i_partitioner;
} // namespace dht
class mutation;
class partition_key;
namespace cdc {
class options final {
bool _enabled = false;
bool _preimage = false;
bool _postimage = false;
int _ttl = 86400; // 24h in seconds
public:
options() = default;
options(const std::map<sstring, sstring>& map) {
if (map.find("enabled") == std::end(map)) {
return;
}
for (auto& p : map) {
if (p.first == "enabled") {
_enabled = p.second == "true";
} else if (p.first == "preimage") {
_preimage = p.second == "true";
} else if (p.first == "postimage") {
_postimage = p.second == "true";
} else if (p.first == "ttl") {
_ttl = std::stoi(p.second);
} else {
throw exceptions::configuration_exception("Invalid CDC option: " + p.first);
}
}
}
std::map<sstring, sstring> to_map() const {
if (!_enabled) {
return {};
}
return {
{ "enabled", _enabled ? "true" : "false" },
{ "preimage", _preimage ? "true" : "false" },
{ "postimage", _postimage ? "true" : "false" },
{ "ttl", std::to_string(_ttl) },
};
}
sstring to_sstring() const {
return json::to_json(to_map());
}
bool enabled() const { return _enabled; }
bool preimage() const { return _preimage; }
bool postimage() const { return _postimage; }
int ttl() const { return _ttl; }
bool operator==(const options& o) const {
return _enabled == o._enabled && _preimage == o._preimage && _postimage == o._postimage && _ttl == o._ttl;
}
bool operator!=(const options& o) const {
return !(*this == o);
}
};
struct db_context final {
service::storage_proxy& _proxy;
service::migration_manager& _migration_manager;
locator::token_metadata& _token_metadata;
locator::snitch_ptr& _snitch;
dht::i_partitioner& _partitioner;
class builder final {
service::storage_proxy& _proxy;
std::optional<std::reference_wrapper<service::migration_manager>> _migration_manager;
std::optional<std::reference_wrapper<locator::token_metadata>> _token_metadata;
std::optional<std::reference_wrapper<locator::snitch_ptr>> _snitch;
std::optional<std::reference_wrapper<dht::i_partitioner>> _partitioner;
public:
builder(service::storage_proxy& proxy) : _proxy(proxy) { }
builder& with_migration_manager(service::migration_manager& migration_manager) {
_migration_manager = migration_manager;
return *this;
}
builder& with_token_metadata(locator::token_metadata& token_metadata) {
_token_metadata = token_metadata;
return *this;
}
builder& with_snitch(locator::snitch_ptr& snitch) {
_snitch = snitch;
return *this;
}
builder& with_partitioner(dht::i_partitioner& partitioner) {
_partitioner = partitioner;
return *this;
}
db_context build();
};
};
/// \brief Sets up CDC related tables for a given table
///
/// This function not only creates CDC Log and CDC Description for a given table
/// but also populates CDC Description with a list of change streams.
///
/// param[in] ctx object with references to database components
/// param[in] schema schema of a table for which CDC tables are being created
seastar::future<> setup(db_context ctx, schema_ptr schema);
// cdc log table operation
enum class operation : int8_t {
// note: these values will eventually be read by a third party, probably not privvy to this
// enum decl, so don't change the constant values (or the datatype).
pre_image = 0, update = 1, row_delete = 2, range_delete_start = 3, range_delete_end = 4, partition_delete = 5
};
// cdc log data column operation
enum class column_op : int8_t {
// same as "operation". Do not edit values or type/type unless you _really_ want to.
set = 0, del = 1, add = 2,
};
/// \brief Deletes CDC Log and CDC Description tables for a given table
///
/// This function cleans up all CDC related tables created for a given table.
/// At the moment, CDC Log and CDC Description are the only affected tables.
/// It's ok if some/all of them don't exist.
///
/// \param[in] ctx object with references to database components
/// \param[in] ks_name keyspace name of a table for which CDC tables are removed
/// \param[in] table_name name of a table for which CDC tables are removed
///
/// \pre This function works correctly no matter if CDC Log and/or CDC Description
/// exist.
seastar::future<>
remove(db_context ctx, const seastar::sstring& ks_name, const seastar::sstring& table_name);
seastar::sstring log_name(const seastar::sstring& table_name);
seastar::sstring desc_name(const seastar::sstring& table_name);
/// \brief For each mutation in the set appends related CDC Log mutation
///
/// This function should be called with a set of mutations of a table
/// with CDC enabled. Returned set of mutations contains all original mutations
/// and for each original mutation appends a mutation to CDC Log that reflects
/// the change.
///
/// \param[in] ctx object with references to database components
/// \param[in] s schema of a CDC enabled table which is being modified
/// \param[in] timeout period of time after which a request is considered timed out
/// \param[in] qs the state of the query that's being executed
/// \param[in] mutations set of changes of a CDC enabled table
///
/// \return set of mutations from input parameter with relevant CDC Log mutations appended
///
/// \pre CDC Log and CDC Description have to exist
/// \pre CDC Description has to be in sync with cluster topology
///
/// \note At the moment, cluster topology changes are not supported
// so the assumption that CDC Description is in sync with cluster topology
// is easy to enforce. When support for cluster topology changes is added
// it has to make sure the assumption holds.
seastar::future<std::vector<mutation>>append_log_mutations(
db_context ctx,
schema_ptr s,
lowres_clock::time_point timeout,
service::query_state& qs,
std::vector<mutation> mutations);
} // namespace cdc

52
cdc/cdc_extension.hh Normal file
View File

@@ -0,0 +1,52 @@
/*
* Copyright 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "serializer.hh"
#include "db/extensions.hh"
#include "cdc/cdc_options.hh"
#include "schema.hh"
namespace cdc {
class cdc_extension : public schema_extension {
cdc::options _cdc_options;
public:
static constexpr auto NAME = "cdc";
cdc_extension() = default;
explicit cdc_extension(std::map<sstring, sstring> tags) : _cdc_options(std::move(tags)) {}
explicit cdc_extension(const bytes& b) : _cdc_options(cdc_extension::deserialize(b)) {}
explicit cdc_extension(const sstring& s) {
throw std::logic_error("Cannot create cdc info from string");
}
bytes serialize() const override {
return ser::serialize_to_buffer<bytes>(_cdc_options.to_map());
}
static std::map<sstring, sstring> deserialize(const bytes_view& buffer) {
return ser::deserialize_from_buffer(buffer, boost::type<std::map<sstring, sstring>>());
}
const options& get_options() const {
return _cdc_options;
}
};
}

51
cdc/cdc_options.hh Normal file
View File

@@ -0,0 +1,51 @@
/*
* Copyright (C) 2019 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <map>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
namespace cdc {
class options final {
bool _enabled = false;
bool _preimage = false;
bool _postimage = false;
int _ttl = 86400; // 24h in seconds
public:
options() = default;
options(const std::map<sstring, sstring>& map);
std::map<sstring, sstring> to_map() const;
sstring to_sstring() const;
bool enabled() const { return _enabled; }
bool preimage() const { return _preimage; }
bool postimage() const { return _postimage; }
int ttl() const { return _ttl; }
bool operator==(const options& o) const;
bool operator!=(const options& o) const;
};
} // namespace cdc

405
cdc/generation.cc Normal file
View File

@@ -0,0 +1,405 @@
/*
* Copyright (C) 2019 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <boost/type.hpp>
#include <random>
#include <unordered_set>
#include <seastar/core/sleep.hh>
#include "keys.hh"
#include "schema_builder.hh"
#include "db/config.hh"
#include "db/system_keyspace.hh"
#include "db/system_distributed_keyspace.hh"
#include "dht/token-sharding.hh"
#include "locator/token_metadata.hh"
#include "gms/application_state.hh"
#include "gms/inet_address.hh"
#include "gms/gossiper.hh"
#include "cdc/generation.hh"
extern logging::logger cdc_log;
static int get_shard_count(const gms::inet_address& endpoint, const gms::gossiper& g) {
auto ep_state = g.get_application_state_ptr(endpoint, gms::application_state::SHARD_COUNT);
return ep_state ? std::stoi(ep_state->value) : -1;
}
static unsigned get_sharding_ignore_msb(const gms::inet_address& endpoint, const gms::gossiper& g) {
auto ep_state = g.get_application_state_ptr(endpoint, gms::application_state::IGNORE_MSB_BITS);
return ep_state ? std::stoi(ep_state->value) : 0;
}
namespace cdc {
extern const api::timestamp_clock::duration generation_leeway =
std::chrono::duration_cast<api::timestamp_clock::duration>(std::chrono::seconds(5));
static void copy_int_to_bytes(int64_t i, size_t offset, bytes& b) {
i = net::hton(i);
std::copy_n(reinterpret_cast<int8_t*>(&i), sizeof(int64_t), b.begin() + offset);
}
stream_id::stream_id(int64_t first, int64_t second)
: _value(bytes::initialized_later(), 2 * sizeof(int64_t))
{
copy_int_to_bytes(first, 0, _value);
copy_int_to_bytes(second, sizeof(int64_t), _value);
}
stream_id::stream_id(bytes b) : _value(std::move(b)) { }
bool stream_id::is_set() const {
return !_value.empty();
}
bool stream_id::operator==(const stream_id& o) const {
return _value == o._value;
}
bool stream_id::operator<(const stream_id& o) const {
return _value < o._value;
}
static int64_t bytes_to_int64(const bytes& b, size_t offset) {
assert(b.size() >= offset + sizeof(int64_t));
int64_t res;
std::copy_n(b.begin() + offset, sizeof(int64_t), reinterpret_cast<int8_t *>(&res));
return net::ntoh(res);
}
int64_t stream_id::first() const {
return bytes_to_int64(_value, 0);
}
int64_t stream_id::second() const {
return bytes_to_int64(_value, sizeof(int64_t));
}
const bytes& stream_id::to_bytes() const {
return _value;
}
partition_key stream_id::to_partition_key(const schema& log_schema) const {
return partition_key::from_single_value(log_schema, _value);
}
bool token_range_description::operator==(const token_range_description& o) const {
return token_range_end == o.token_range_end && streams == o.streams
&& sharding_ignore_msb == o.sharding_ignore_msb;
}
topology_description::topology_description(std::vector<token_range_description> entries)
: _entries(std::move(entries)) {}
bool topology_description::operator==(const topology_description& o) const {
return _entries == o._entries;
}
const std::vector<token_range_description>& topology_description::entries() const {
return _entries;
}
static stream_id make_random_stream_id() {
static thread_local std::mt19937_64 rand_gen(std::random_device().operator()());
static thread_local std::uniform_int_distribution<int64_t> rand_dist(std::numeric_limits<int64_t>::min());
return {rand_dist(rand_gen), rand_dist(rand_gen)};
}
/* Given:
* 1. a set of tokens which split the token ring into token ranges (vnodes),
* 2. information on how each token range is distributed among its owning node's shards
* this function tries to generate a set of CDC stream identifiers such that for each
* shard and vnode pair there exists a stream whose token falls into this
* vnode and is owned by this shard.
*
* It then builds a cdc::topology_description which maps tokens to these
* found stream identifiers, such that if token T is owned by shard S in vnode V,
* it gets mapped to the stream identifier generated for (S, V).
*/
// Run in seastar::async context.
topology_description generate_topology_description(
const db::config& cfg,
const std::unordered_set<dht::token>& bootstrap_tokens,
const locator::token_metadata& token_metadata,
const gms::gossiper& gossiper) {
if (bootstrap_tokens.empty()) {
throw std::runtime_error(
"cdc: bootstrap tokens is empty in generate_topology_description");
}
auto tokens = token_metadata.sorted_tokens();
tokens.insert(tokens.end(), bootstrap_tokens.begin(), bootstrap_tokens.end());
std::sort(tokens.begin(), tokens.end());
tokens.erase(std::unique(tokens.begin(), tokens.end()), tokens.end());
std::vector<token_range_description> entries(tokens.size());
int spots_to_fill = 0;
for (size_t i = 0; i < tokens.size(); ++i) {
auto& entry = entries[i];
entry.token_range_end = tokens[i];
if (bootstrap_tokens.count(entry.token_range_end) > 0) {
entry.streams.resize(smp::count);
entry.sharding_ignore_msb = cfg.murmur3_partitioner_ignore_msb_bits();
} else {
auto endpoint = token_metadata.get_endpoint(entry.token_range_end);
if (!endpoint) {
throw std::runtime_error(format("Can't find endpoint for token {}", entry.token_range_end));
}
auto sc = get_shard_count(*endpoint, gossiper);
entry.streams.resize(sc > 0 ? sc : 1);
entry.sharding_ignore_msb = get_sharding_ignore_msb(*endpoint, gossiper);
}
spots_to_fill += entry.streams.size();
}
auto schema = schema_builder("fake_ks", "fake_table")
.with_column("stream_id", bytes_type, column_kind::partition_key)
.build();
auto quota = std::chrono::seconds(spots_to_fill / 2000 + 1);
auto start_time = std::chrono::system_clock::now();
// For each pair (i, j), 0 <= i < streams.size(), 0 <= j < streams[i].size(),
// try to find a stream (stream[i][j]) such that the token of this stream will get mapped to this stream
// (refer to the comments above topology_description's definition to understand how it describes the mapping).
// We find the streams by randomly generating them and checking into which pairs they get mapped.
// NOTE: this algorithm is temporary and will be replaced after per-table-partitioner feature gets merged in.
repeat([&] {
for (int i = 0; i < 500; ++i) {
auto stream_id = make_random_stream_id();
auto token = dht::get_token(*schema, stream_id.to_partition_key(*schema));
// Find the token range into which our stream_id's token landed.
auto it = std::lower_bound(tokens.begin(), tokens.end(), token);
auto& entry = entries[it != tokens.end() ? std::distance(tokens.begin(), it) : 0];
auto shard_id = dht::shard_of(entry.streams.size(), entry.sharding_ignore_msb, token);
assert(shard_id < entry.streams.size());
if (!entry.streams[shard_id].is_set()) {
--spots_to_fill;
entry.streams[shard_id] = stream_id;
}
}
if (!spots_to_fill) {
return stop_iteration::yes;
}
auto now = std::chrono::system_clock::now();
auto passed = std::chrono::duration_cast<std::chrono::seconds>(now - start_time);
if (passed > quota) {
return stop_iteration::yes;
}
return stop_iteration::no;
}).get();
if (spots_to_fill) {
// We were not able to generate stream ids for each (token range, shard) pair.
// For each range that has a stream, for each shard for this range that doesn't have a stream,
// use the stream id of the next shard for this range.
// For each range that doesn't have any stream,
// use streams of the first range to the left which does have a stream.
cdc_log.warn("Generation of CDC streams failed to create streams for some (vnode, shard) pair."
" This can lead to worse performance.");
stream_id some_stream;
size_t idx = 0;
for (; idx < entries.size(); ++idx) {
for (auto s: entries[idx].streams) {
if (s.is_set()) {
some_stream = s;
break;
}
}
if (some_stream.is_set()) {
break;
}
}
assert(idx != entries.size() && some_stream.is_set());
// Iterate over all ranges in the clockwise direction, starting with the one we found a stream for.
for (size_t off = 0; off < entries.size(); ++off) {
auto& ss = entries[(idx + off) % entries.size()].streams;
int last_set_stream_idx = ss.size() - 1;
while (last_set_stream_idx > -1 && !ss[last_set_stream_idx].is_set()) {
--last_set_stream_idx;
}
if (last_set_stream_idx == -1) {
cdc_log.warn(
"CDC wasn't able to generate any stream for vnode ({}, {}]. We'll use another vnode's streams"
" instead. This might lead to inconsistencies.",
tokens[(idx + off + entries.size() - 1) % entries.size()], tokens[(idx + off) % entries.size()]);
ss[0] = some_stream;
last_set_stream_idx = 0;
}
some_stream = ss[last_set_stream_idx];
// Replace 'unset' stream ids with indexes below last_set_stream_idx
for (int s_idx = last_set_stream_idx - 1; s_idx > -1; --s_idx) {
if (ss[s_idx].is_set()) {
some_stream = ss[s_idx];
} else {
ss[s_idx] = some_stream;
}
}
// Replace 'unset' stream ids with indexes above last_set_stream_idx
for (int s_idx = ss.size() - 1; s_idx > last_set_stream_idx; --s_idx) {
if (ss[s_idx].is_set()) {
some_stream = ss[s_idx];
} else {
ss[s_idx] = some_stream;
}
}
}
}
return {std::move(entries)};
}
bool should_propose_first_generation(const gms::inet_address& me, const gms::gossiper& g) {
auto my_host_id = g.get_host_id(me);
auto& eps = g.get_endpoint_states();
return std::none_of(eps.begin(), eps.end(),
[&] (const std::pair<gms::inet_address, gms::endpoint_state>& ep) {
return my_host_id < g.get_host_id(ep.first);
});
}
future<db_clock::time_point> get_local_streams_timestamp() {
return db::system_keyspace::get_saved_cdc_streams_timestamp().then([] (std::optional<db_clock::time_point> ts) {
if (!ts) {
auto err = format("get_local_streams_timestamp: tried to retrieve streams timestamp after bootstrapping, but it's not present");
cdc_log.error("{}", err);
throw std::runtime_error(err);
}
return *ts;
});
}
// Run inside seastar::async context.
db_clock::time_point make_new_cdc_generation(
const db::config& cfg,
const std::unordered_set<dht::token>& bootstrap_tokens,
const locator::token_metadata& tm,
const gms::gossiper& g,
db::system_distributed_keyspace& sys_dist_ks,
std::chrono::milliseconds ring_delay,
bool for_testing) {
assert(!bootstrap_tokens.empty());
auto gen = generate_topology_description(cfg, bootstrap_tokens, tm, g);
// Begin the race.
auto ts = db_clock::now() + (
for_testing ? std::chrono::milliseconds(0) : (
2 * ring_delay + std::chrono::duration_cast<std::chrono::milliseconds>(generation_leeway)));
sys_dist_ks.insert_cdc_topology_description(ts, std::move(gen), { tm.count_normal_token_owners() }).get();
return ts;
}
std::optional<db_clock::time_point> get_streams_timestamp_for(const gms::inet_address& endpoint, const gms::gossiper& g) {
auto streams_ts_string = g.get_application_state_value(endpoint, gms::application_state::CDC_STREAMS_TIMESTAMP);
cdc_log.trace("endpoint={}, streams_ts_string={}", endpoint, streams_ts_string);
if (streams_ts_string.empty()) {
return {};
}
return db_clock::time_point(db_clock::duration(std::stoll(streams_ts_string)));
}
// Run inside seastar::async context.
static void do_update_streams_description(
db_clock::time_point streams_ts,
db::system_distributed_keyspace& sys_dist_ks,
db::system_distributed_keyspace::context ctx) {
if (sys_dist_ks.cdc_desc_exists(streams_ts, ctx).get0()) {
cdc_log.debug("update_streams_description: description of generation {} already inserted", streams_ts);
return;
}
// We might race with another node also inserting the description, but that's ok. It's an idempotent operation.
auto topo = sys_dist_ks.read_cdc_topology_description(streams_ts, ctx).get0();
if (!topo) {
throw std::runtime_error(format("could not find streams data for timestamp {}", streams_ts));
}
std::set<cdc::stream_id> streams_set;
for (auto& entry: topo->entries()) {
streams_set.insert(entry.streams.begin(), entry.streams.end());
}
std::vector<cdc::stream_id> streams_vec(streams_set.begin(), streams_set.end());
sys_dist_ks.create_cdc_desc(streams_ts, streams_vec, ctx).get();
cdc_log.info("CDC description table successfully updated with generation {}.", streams_ts);
}
void update_streams_description(
db_clock::time_point streams_ts,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
try {
do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
} catch(...) {
cdc_log.warn(
"Could not update CDC description table with generation {}: {}. Will retry in the background.",
streams_ts, std::current_exception());
// It is safe to discard this future: we keep system distributed keyspace alive.
(void)seastar::async([
streams_ts, sys_dist_ks, get_num_token_owners = std::move(get_num_token_owners), &abort_src
] {
while (true) {
sleep_abortable(std::chrono::seconds(60), abort_src).get();
try {
do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
return;
} catch (...) {
cdc_log.warn(
"Could not update CDC description table with generation {}: {}. Will try again.",
streams_ts, std::current_exception());
}
}
});
}
}
} // namespace cdc

176
cdc/generation.hh Normal file
View File

@@ -0,0 +1,176 @@
/*
* Copyright (C) 2019 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
/* This module contains classes and functions used to manage CDC generations:
* sets of CDC stream identifiers used by the cluster to choose partition keys for CDC log writes.
* Each CDC generation begins operating at a specific time point, called the generation's timestamp
* (`cdc_streams_timpestamp` or `streams_timestamp` in the code).
* The generation is used by all nodes in the cluster to pick CDC streams until superseded by a new generation.
*
* Functions from this module are used by the node joining procedure to introduce new CDC generations to the cluster
* (which is necessary due to new tokens being inserted into the token ring), or during rolling upgrade
* if CDC is enabled for the first time.
*/
#pragma once
#include <vector>
#include <unordered_set>
#include <seastar/util/noncopyable_function.hh>
#include "database_fwd.hh"
#include "db_clock.hh"
#include "dht/token.hh"
namespace seastar {
class abort_source;
} // namespace seastar
namespace db {
class config;
class system_distributed_keyspace;
} // namespace db
namespace gms {
class inet_address;
class gossiper;
} // namespace gms
namespace locator {
class token_metadata;
} // namespace locator
namespace cdc {
class stream_id final {
bytes _value;
public:
stream_id() = default;
stream_id(int64_t, int64_t);
stream_id(bytes);
bool is_set() const;
bool operator==(const stream_id&) const;
bool operator<(const stream_id&) const;
int64_t first() const;
int64_t second() const;
const bytes& to_bytes() const;
partition_key to_partition_key(const schema& log_schema) const;
};
/* Describes a mapping of tokens to CDC streams in a token range.
*
* The range ends with `token_range_end`. A vector of `token_range_description`s defines the ranges entirely
* (the end of the `i`th range is the beginning of the `i+1 % size()`th range). Ranges are left-opened, right-closed.
*
* Tokens in the range ending with `token_range_end` are mapped to streams in the `streams` vector as follows:
* token `T` is mapped to `streams[j]` if and only if the used partitioner maps `T` to the `j`th shard,
* assuming that the partitioner is configured for `streams.size()` shards and (partitioner's) `sharding_ignore_msb`
* equals to the given `sharding_ignore_msb`.
*/
struct token_range_description {
dht::token token_range_end;
std::vector<stream_id> streams;
uint8_t sharding_ignore_msb;
bool operator==(const token_range_description&) const;
};
/* Describes a mapping of tokens to CDC streams in a whole token ring.
*
* Division of the ring to token ranges is defined in terms of `token_range_end`s
* in the `_entries` vector. See the comment above `token_range_description` for explanation.
*/
class topology_description {
std::vector<token_range_description> _entries;
public:
topology_description(std::vector<token_range_description> entries);
bool operator==(const topology_description&) const;
const std::vector<token_range_description>& entries() const;
};
/* Should be called when we're restarting and we noticed that we didn't save any streams timestamp in our local tables,
* which means that we're probably upgrading from a non-CDC/old CDC version (another reason could be
* that there's a bug, or the user messed with our local tables).
*
* It checks whether we should be the node to propose the first generation of CDC streams.
* The chosen condition is arbitrary, it only tries to make sure that no two nodes propose a generation of streams
* when upgrading, and nothing bad happens if they for some reason do (it's mostly an optimization).
*/
bool should_propose_first_generation(const gms::inet_address& me, const gms::gossiper&);
/*
* Read this node's streams generation timestamp stored in the LOCAL table.
* Assumes that the node has successfully bootstrapped, and we're not upgrading from a non-CDC version,
* so the timestamp is present.
*/
future<db_clock::time_point> get_local_streams_timestamp();
/* Generate a new set of CDC streams and insert it into the distributed cdc_topology_description table.
* Returns the timestamp of this new generation.
*
* Should be called when starting the node for the first time (i.e., joining the ring).
*
* Assumes that the system_distributed keyspace is initialized.
*
* The caller of this function is expected to insert this timestamp into the gossiper as fast as possible,
* so that other nodes learn about the generation before their clocks cross the timestmap
* (not guaranteed in the current implementation, but expected to be the common case;
* we assume that `ring_delay` is enough for other nodes to learn about the new generation).
*/
db_clock::time_point make_new_cdc_generation(
const db::config& cfg,
const std::unordered_set<dht::token>& bootstrap_tokens,
const locator::token_metadata& tm,
const gms::gossiper& g,
db::system_distributed_keyspace& sys_dist_ks,
std::chrono::milliseconds ring_delay,
bool for_testing);
/* Retrieves CDC streams generation timestamp from the given endpoint's application state (broadcasted through gossip).
* We might be during a rolling upgrade, so the timestamp might not be there (if the other node didn't upgrade yet),
* but if the cluster already supports CDC, then every newly joining node will propose a new CDC generation,
* which means it will gossip the generation's timestamp.
*/
std::optional<db_clock::time_point> get_streams_timestamp_for(const gms::inet_address& endpoint, const gms::gossiper&);
/* Inform CDC users about a generation of streams (identified by the given timestamp)
* by inserting it into the cdc_description table.
*
* Assumes that the cdc_topology_description table contains this generation.
*
* Returning from this function does not mean that the table update was successful: the function
* might run an asynchronous task in the background.
*
* Run inside seastar::async context.
*/
void update_streams_description(
db_clock::time_point,
shared_ptr<db::system_distributed_keyspace>,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source&);
} // namespace cdc

1394
cdc/log.cc Normal file

File diff suppressed because it is too large Load Diff

145
cdc/log.hh Normal file
View File

@@ -0,0 +1,145 @@
/*
* Copyright (C) 2019 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
/*
* This module manages CDC log tables. It contains facilities used to:
* - perform schema changes to CDC log tables correspondingly when base tables are changed,
* - perform writes to CDC log tables correspondingly when writes to base tables are made.
*/
#pragma once
#include <functional>
#include <optional>
#include <map>
#include <string>
#include <vector>
#include <seastar/core/future.hh>
#include <seastar/core/lowres_clock.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/sstring.hh>
#include "exceptions/exceptions.hh"
#include "timestamp.hh"
#include "tracing/trace_state.hh"
#include "cdc_options.hh"
#include "utils/UUID.hh"
class schema;
using schema_ptr = seastar::lw_shared_ptr<const schema>;
namespace locator {
class token_metadata;
} // namespace locator
namespace service {
class migration_notifier;
class storage_proxy;
class query_state;
} // namespace service
class mutation;
class partition_key;
namespace cdc {
struct operation_result_tracker;
class db_context;
class metadata;
/// \brief CDC service, responsible for schema listeners
///
/// CDC service will listen for schema changes and iff CDC is enabled/changed
/// create/modify/delete corresponding log tables etc as part of the schema change.
///
class cdc_service {
class impl;
std::unique_ptr<impl> _impl;
public:
future<> stop();
cdc_service(service::storage_proxy&);
cdc_service(db_context);
~cdc_service();
// If any of the mutations are cdc enabled, optionally selects preimage, and adds the
// appropriate augments to set the log entries.
// Iff post-image is enabled for any of these, a non-empty callback is also
// returned to be invoked post the mutation query.
future<std::tuple<std::vector<mutation>, lw_shared_ptr<operation_result_tracker>>> augment_mutation_call(
lowres_clock::time_point timeout,
std::vector<mutation>&& mutations,
tracing::trace_state_ptr tr_state
);
bool needs_cdc_augmentation(const std::vector<mutation>&) const;
};
struct db_context final {
service::storage_proxy& _proxy;
service::migration_notifier& _migration_notifier;
locator::token_metadata& _token_metadata;
cdc::metadata& _cdc_metadata;
class builder final {
service::storage_proxy& _proxy;
std::optional<std::reference_wrapper<service::migration_notifier>> _migration_notifier;
std::optional<std::reference_wrapper<locator::token_metadata>> _token_metadata;
std::optional<std::reference_wrapper<cdc::metadata>> _cdc_metadata;
public:
builder(service::storage_proxy& proxy);
builder& with_migration_notifier(service::migration_notifier& migration_notifier);
builder& with_token_metadata(locator::token_metadata& token_metadata);
builder& with_cdc_metadata(cdc::metadata&);
db_context build();
};
};
// cdc log table operation
enum class operation : int8_t {
// note: these values will eventually be read by a third party, probably not privvy to this
// enum decl, so don't change the constant values (or the datatype).
pre_image = 0, update = 1, insert = 2, row_delete = 3, partition_delete = 4,
range_delete_start_inclusive = 5, range_delete_start_exclusive = 6, range_delete_end_inclusive = 7, range_delete_end_exclusive = 8,
post_image = 9,
};
bool is_log_for_some_table(const sstring& ks_name, const std::string_view& table_name);
seastar::sstring log_name(const seastar::sstring& table_name);
seastar::sstring log_data_column_name(std::string_view column_name);
seastar::sstring log_meta_column_name(std::string_view column_name);
bytes log_data_column_name_bytes(const bytes& column_name);
bytes log_meta_column_name_bytes(const bytes& column_name);
seastar::sstring log_data_column_deleted_name(std::string_view column_name);
bytes log_data_column_deleted_name_bytes(const bytes& column_name);
seastar::sstring log_data_column_deleted_elements_name(std::string_view column_name);
bytes log_data_column_deleted_elements_name_bytes(const bytes& column_name);
utils::UUID generate_timeuuid(api::timestamp_type t);
} // namespace cdc

200
cdc/metadata.cc Normal file
View File

@@ -0,0 +1,200 @@
/*
* Copyright (C) 2019 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "dht/token-sharding.hh"
#include "utils/exceptions.hh"
#include "exceptions/exceptions.hh"
#include "cdc/generation.hh"
#include "cdc/metadata.hh"
extern logging::logger cdc_log;
namespace cdc {
extern const api::timestamp_clock::duration generation_leeway;
} // namespace cdc
static api::timestamp_type to_ts(db_clock::time_point tp) {
// This assumes that timestamp_clock and db_clock have the same epochs.
return std::chrono::duration_cast<api::timestamp_clock::duration>(tp.time_since_epoch()).count();
}
static cdc::stream_id get_stream(
const cdc::token_range_description& entry,
dht::token tok) {
// The ith stream is the stream for the ith shard.
auto shard_cnt = entry.streams.size();
auto shard_id = dht::shard_of(shard_cnt, entry.sharding_ignore_msb, tok);
if (shard_id >= shard_cnt) {
on_internal_error(cdc_log, "get_stream: shard_id out of bounds");
}
return entry.streams[shard_id];
}
static cdc::stream_id get_stream(
const std::vector<cdc::token_range_description>& entries,
dht::token tok) {
if (entries.empty()) {
on_internal_error(cdc_log, "get_stream: entries empty");
}
auto it = std::lower_bound(entries.begin(), entries.end(), tok,
[] (const cdc::token_range_description& e, dht::token t) { return e.token_range_end < t; });
if (it == entries.end()) {
it = entries.begin();
}
return get_stream(*it, tok);
}
cdc::metadata::container_t::const_iterator cdc::metadata::gen_used_at(api::timestamp_type ts) const {
auto it = _gens.upper_bound(ts);
if (it == _gens.begin()) {
// All known generations have higher timestamps than `ts`.
return _gens.end();
}
return std::prev(it);
}
cdc::stream_id cdc::metadata::get_stream(api::timestamp_type ts, dht::token tok) {
auto now = api::new_timestamp();
if (ts > now + generation_leeway.count()) {
throw exceptions::invalid_request_exception(format(
"cdc: attempted to get a stream \"from the future\" ({}; current server time: {})."
" With CDC you cannot send writes with timestamps arbitrarily into the future, because we don't"
" know what streams will be used at that time.\n"
"We *do* allow sending writes into the near future, but our ability to do that is limited."
" If you really must use your own timestamps, then make sure your clocks are well-synchronized"
" with the database's clocks.", format_timestamp(ts), format_timestamp(now)));
// Note that we might still send a write to a wrong generation, if we learn about the current
// generation too late (we might think that an earlier generation is the current one).
// Nothing protects us from that until we start using transactions for generation switching.
}
auto it = gen_used_at(now);
if (it == _gens.end()) {
throw std::runtime_error(format(
"cdc::metadata::get_stream: could not find any CDC stream (current time: {})."
" Are we in the middle of a cluster upgrade?", format_timestamp(now)));
}
// Garbage-collect generations that will no longer be used.
it = _gens.erase(_gens.begin(), it);
if (it->first > ts) {
throw exceptions::invalid_request_exception(format(
"cdc: attempted to get a stream from an earlier generation than the currently used one."
" With CDC you cannot send writes with timestamps too far into the past, because that would break"
" consistency properties (write timestamp: {}, current generation started at: {})",
format_timestamp(ts), format_timestamp(it->first)));
}
// With `generation_leeway` we allow sending writes to the near future. It might happen
// that `ts` doesn't belong to the current generation ("current" according to our clock),
// but to the next generation. Adjust for this case:
{
auto next_it = std::next(it);
while (next_it != _gens.end() && next_it->first <= ts) {
it = next_it++;
}
}
// Note: if there is a next generation that `ts` belongs to, but we don't know about it,
// then too bad. This is no different from the situation in which we didn't manage to learn
// about the current generation in time. We won't be able to prevent it until we introduce transactions.
if (!it->second) {
throw std::runtime_error(format(
"cdc: attempted to get a stream from a generation that we know about, but weren't able to retrieve"
" (generation timestamp: {}, write timestamp: {}). Make sure that the replicas which contain"
" this generation's data are alive and reachable from this node.", format_timestamp(it->first), format_timestamp(ts)));
}
auto& gen = *it->second;
auto ret = ::get_stream(gen.entries(), tok);
_last_stream_timestamp = ts;
return ret;
}
bool cdc::metadata::known_or_obsolete(db_clock::time_point tp) const {
auto ts = to_ts(tp);
auto it = _gens.lower_bound(ts);
if (it == _gens.end()) {
// No known generations with timestamp >= ts.
return false;
}
if (it->first == ts) {
if (it->second) {
// We already inserted this particular generation.
return true;
}
++it;
}
// Check if some new generation has already superseded this one.
return it != _gens.end() && it->first <= api::new_timestamp();
}
bool cdc::metadata::insert(db_clock::time_point tp, topology_description&& gen) {
if (known_or_obsolete(tp)) {
return false;
}
auto now = api::new_timestamp();
auto it = gen_used_at(now);
if (it != _gens.end()) {
// Garbage-collect generations that will no longer be used.
it = _gens.erase(_gens.begin(), it);
}
_gens.insert_or_assign(to_ts(tp), std::move(gen));
return true;
}
bool cdc::metadata::prepare(db_clock::time_point tp) {
if (known_or_obsolete(tp)) {
return false;
}
auto ts = to_ts(tp);
auto emplaced = _gens.emplace(to_ts(tp), std::nullopt).second;
if (_last_stream_timestamp != api::missing_timestamp) {
auto last_correct_gen = gen_used_at(_last_stream_timestamp);
if (emplaced && last_correct_gen != _gens.end() && last_correct_gen->first == ts) {
cdc_log.error(
"just learned about a CDC generation newer than the one used the last time"
" streams were retrieved. This generation, or some newer one, should have"
" been used instead (new generation's timestamp: {}, last time streams were retrieved: {})."
" The new generation probably arrived too late due to a network partition"
" and we've made a write using the wrong set streams.",
format_timestamp(ts), format_timestamp(_last_stream_timestamp));
}
}
return emplaced;
}

92
cdc/metadata.hh Normal file
View File

@@ -0,0 +1,92 @@
/*
* Copyright (C) 2019 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <map>
#include "db_clock.hh"
#include "timestamp.hh"
namespace dht {
class token;
}
namespace cdc {
class stream_id;
class topology_description;
/* Represents the node's knowledge about CDC generations used in the cluster.
* Used during writes to pick streams to which CDC log writes should be sent to
* (i.e., to pick partition keys for these writes).
*/
class metadata final {
// Note: we use db_clock (1ms resolution) for generation timestaps
// (because we need to insert them into tables using columns of timestamp types,
// and the native type of our columns' timestamp_type is db_clock::time_point).
// On the other hand, timestamp_clock (1us resolution) is used for mutation timestamps,
// and api::timestamp_type represents the number of ticks of a timestamp_clock::time_point since epoch.
using container_t = std::map<api::timestamp_type, std::optional<topology_description>>;
container_t _gens;
/* The timestamp used in the last successful `get_stream` call. */
api::timestamp_type _last_stream_timestamp = api::missing_timestamp;
container_t::const_iterator gen_used_at(api::timestamp_type ts) const;
public:
/* Is a generation with the given timestamp already known or superseded by a newer generation? */
bool known_or_obsolete(db_clock::time_point) const;
/* Return the stream for the base partition whose token is `tok` to which a corresponding log write should go
* according to the generation used at time `ts` (i.e, the latest generation whose timestamp is less or equal to `ts`).
*
* If the provided timestamp is too far away "into the future" (where "now" is defined according to our local clock),
* we reject the get_stream query. This is because the resulting stream might belong to a generation which we don't
* yet know about. The amount of leeway (how much "into the future" we allow `ts` to be) is defined
* by the `cdc::generation_leeway` constant.
*/
stream_id get_stream(api::timestamp_type ts, dht::token tok);
/* Insert the generation given by `gen` with timestamp `ts` to be used by the `get_stream` function,
* if the generation is not already known or older than the currently known ones.
*
* Returns true if the generation was inserted,
* meaning that `get_stream` might return a stream from this generation (at some time points).
*/
bool insert(db_clock::time_point ts, topology_description&& gen);
/* Prepare for inserting a new generation whose timestamp is `ts`.
* This method is not required to be called before `insert`, but it's here
* to increase safety of `get_stream` calls in some situations. Use it if you:
* 1. know that there is a new generation, but
* 2. you didn't yet retrieve the generation's topology_description.
*
* After preparing a generation, if `get_stream` is supposed to return a stream from this generation
* but we don't yet have the generation's data, it will reject the query to maintain consistency of streams.
*
* Returns true iff this generation is not obsolete and wasn't previously prepared nor inserted.
*/
bool prepare(db_clock::time_point ts);
};
} // namespace cdc

463
cdc/split.cc Normal file
View File

@@ -0,0 +1,463 @@
/*
* Copyright (C) 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "mutation.hh"
#include "schema.hh"
#include "split.hh"
#include "log.hh"
struct atomic_column_update {
column_id id;
atomic_cell cell;
};
// see the comment inside `clustered_row_insert` for motivation for separating
// nonatomic deletions from nonatomic updates
struct nonatomic_column_deletion {
column_id id;
tombstone t;
};
struct nonatomic_column_update {
column_id id;
utils::chunked_vector<std::pair<bytes, atomic_cell>> cells;
};
struct static_row_update {
gc_clock::duration ttl;
std::vector<atomic_column_update> atomic_entries;
std::vector<nonatomic_column_deletion> nonatomic_deletions;
std::vector<nonatomic_column_update> nonatomic_updates;
};
struct clustered_row_insert {
gc_clock::duration ttl;
clustering_key key;
row_marker marker;
std::vector<atomic_column_update> atomic_entries;
std::vector<nonatomic_column_deletion> nonatomic_deletions;
// INSERTs can't express updates of individual cells inside a non-atomic
// (without deleting the entire field first), so no `nonatomic_updates` field
// overwriting a nonatomic column inside an INSERT will be split into two changes:
// one with a nonatomic deletion, and one with a nonatomic update
};
struct clustered_row_update {
gc_clock::duration ttl;
clustering_key key;
std::vector<atomic_column_update> atomic_entries;
std::vector<nonatomic_column_deletion> nonatomic_deletions;
std::vector<nonatomic_column_update> nonatomic_updates;
};
struct clustered_row_deletion {
clustering_key key;
tombstone t;
};
struct clustered_range_deletion {
range_tombstone rt;
};
struct partition_deletion {
tombstone t;
};
struct batch {
std::vector<static_row_update> static_updates;
std::vector<clustered_row_insert> clustered_inserts;
std::vector<clustered_row_update> clustered_updates;
std::vector<clustered_row_deletion> clustered_row_deletions;
std::vector<clustered_range_deletion> clustered_range_deletions;
std::optional<partition_deletion> partition_deletions;
};
using set_of_changes = std::map<api::timestamp_type, batch>;
struct row_update {
std::vector<atomic_column_update> atomic_entries;
std::vector<nonatomic_column_deletion> nonatomic_deletions;
std::vector<nonatomic_column_update> nonatomic_updates;
};
static
std::map<std::pair<api::timestamp_type, gc_clock::duration>, row_update>
extract_row_updates(const row& r, column_kind ckind, const schema& schema) {
std::map<std::pair<api::timestamp_type, gc_clock::duration>, row_update> result;
r.for_each_cell([&] (column_id id, const atomic_cell_or_collection& cell) {
auto& cdef = schema.column_at(ckind, id);
if (cdef.is_atomic()) {
auto view = cell.as_atomic_cell(cdef);
auto timestamp_and_ttl = std::pair(
view.timestamp(),
view.is_live_and_has_ttl() ? view.ttl() : gc_clock::duration(0)
);
result[timestamp_and_ttl].atomic_entries.push_back({id, atomic_cell(*cdef.type, view)});
return;
}
cell.as_collection_mutation().with_deserialized(*cdef.type, [&] (collection_mutation_view_description mview) {
auto desc = mview.materialize(*cdef.type);
for (auto& [k, v]: desc.cells) {
auto timestamp_and_ttl = std::pair(
v.timestamp(),
v.is_live_and_has_ttl() ? v.ttl() : gc_clock::duration(0)
);
auto& updates = result[timestamp_and_ttl].nonatomic_updates;
if (updates.empty() || updates.back().id != id) {
updates.push_back({id, {}});
}
updates.back().cells.push_back({std::move(k), std::move(v)});
}
if (desc.tomb) {
auto timestamp_and_ttl = std::pair(desc.tomb.timestamp, gc_clock::duration(0));
result[timestamp_and_ttl].nonatomic_deletions.push_back({id, desc.tomb});
}
});
});
return result;
};
set_of_changes extract_changes(const mutation& base_mutation, const schema& base_schema) {
set_of_changes res;
auto& p = base_mutation.partition();
auto sr_updates = extract_row_updates(p.static_row().get(), column_kind::static_column, base_schema);
for (auto& [k, up]: sr_updates) {
auto [timestamp, ttl] = k;
res[timestamp].static_updates.push_back({
ttl,
std::move(up.atomic_entries),
std::move(up.nonatomic_deletions),
std::move(up.nonatomic_updates)
});
}
for (const rows_entry& cr : p.clustered_rows()) {
auto cr_updates = extract_row_updates(cr.row().cells(), column_kind::regular_column, base_schema);
const auto& marker = cr.row().marker();
auto marker_timestamp = marker.timestamp();
auto marker_ttl = marker.is_expiring() ? marker.ttl() : gc_clock::duration(0);
if (marker.is_live()) {
// make sure that an entry corresponding to the row marker's timestamp and ttl is in the map
(void)cr_updates[std::pair(marker_timestamp, marker_ttl)];
}
auto is_insert = [&] (api::timestamp_type timestamp, gc_clock::duration ttl) {
if (!marker.is_live()) {
return false;
}
return timestamp == marker_timestamp && ttl == marker_ttl;
};
for (auto& [k, up]: cr_updates) {
auto [timestamp, ttl] = k;
if (is_insert(timestamp, ttl)) {
res[timestamp].clustered_inserts.push_back({
ttl,
cr.key(),
marker,
std::move(up.atomic_entries),
std::move(up.nonatomic_deletions)
});
if (!up.nonatomic_updates.empty()) {
// nonatomic updates cannot be expressed with an INSERT.
res[timestamp].clustered_updates.push_back({
ttl,
cr.key(),
{},
{},
std::move(up.nonatomic_updates)
});
}
} else {
res[timestamp].clustered_updates.push_back({
ttl,
cr.key(),
std::move(up.atomic_entries),
std::move(up.nonatomic_deletions),
std::move(up.nonatomic_updates)
});
}
}
auto row_tomb = cr.row().deleted_at().regular();
if (row_tomb) {
res[row_tomb.timestamp].clustered_row_deletions.push_back({cr.key(), row_tomb});
}
}
for (const auto& rt: p.row_tombstones()) {
if (rt.tomb.timestamp != api::missing_timestamp) {
res[rt.tomb.timestamp].clustered_range_deletions.push_back({rt});
}
}
auto partition_tomb_timestamp = p.partition_tombstone().timestamp;
if (partition_tomb_timestamp != api::missing_timestamp) {
res[partition_tomb_timestamp].partition_deletions = {p.partition_tombstone()};
}
return res;
}
namespace cdc {
bool should_split(const mutation& base_mutation, const schema& base_schema) {
auto& p = base_mutation.partition();
api::timestamp_type found_ts = api::missing_timestamp;
std::optional<gc_clock::duration> found_ttl; // 0 = "no ttl"
auto check_or_set = [&] (api::timestamp_type ts, gc_clock::duration ttl) {
if (found_ts != api::missing_timestamp && found_ts != ts) {
return true;
}
found_ts = ts;
if (found_ttl && *found_ttl != ttl) {
return true;
}
found_ttl = ttl;
return false;
};
bool had_static_row = false;
bool should_split = false;
p.static_row().get().for_each_cell([&] (column_id id, const atomic_cell_or_collection& cell) {
had_static_row = true;
auto& cdef = base_schema.column_at(column_kind::static_column, id);
if (cdef.is_atomic()) {
auto view = cell.as_atomic_cell(cdef);
if (check_or_set(view.timestamp(), view.is_live_and_has_ttl() ? view.ttl() : gc_clock::duration(0))) {
should_split = true;
}
return;
}
cell.as_collection_mutation().with_deserialized(*cdef.type, [&] (collection_mutation_view_description mview) {
auto desc = mview.materialize(*cdef.type);
for (auto& [k, v]: desc.cells) {
if (check_or_set(v.timestamp(), v.is_live_and_has_ttl() ? v.ttl() : gc_clock::duration(0))) {
should_split = true;
return;
}
}
if (desc.tomb) {
if (check_or_set(desc.tomb.timestamp, gc_clock::duration(0))) {
should_split = true;
return;
}
}
});
});
if (should_split) {
return true;
}
bool had_clustered_row = false;
if (!p.clustered_rows().empty() && had_static_row) {
return true;
}
for (const rows_entry& cr : p.clustered_rows()) {
had_clustered_row = true;
const auto& marker = cr.row().marker();
if (marker.is_live() && check_or_set(marker.timestamp(), marker.is_expiring() ? marker.ttl() : gc_clock::duration(0))) {
return true;
}
bool is_insert = marker.is_live();
bool had_cells = false;
cr.row().cells().for_each_cell([&] (column_id id, const atomic_cell_or_collection& cell) {
had_cells = true;
auto& cdef = base_schema.column_at(column_kind::regular_column, id);
if (cdef.is_atomic()) {
auto view = cell.as_atomic_cell(cdef);
if (check_or_set(view.timestamp(), view.is_live_and_has_ttl() ? view.ttl() : gc_clock::duration(0))) {
should_split = true;
}
return;
}
cell.as_collection_mutation().with_deserialized(*cdef.type, [&] (collection_mutation_view_description mview) {
for (auto& [k, v]: mview.cells) {
if (check_or_set(v.timestamp(), v.is_live_and_has_ttl() ? v.ttl() : gc_clock::duration(0))) {
should_split = true;
return;
}
if (is_insert) {
// nonatomic updates cannot be expressed with an INSERT.
should_split = true;
return;
}
}
if (mview.tomb) {
if (check_or_set(mview.tomb.timestamp, gc_clock::duration(0))) {
should_split = true;
return;
}
}
});
});
if (should_split) {
return true;
}
auto row_tomb = cr.row().deleted_at().regular();
if (row_tomb) {
if (had_cells) {
return true;
}
// there were no cells, so no ttl
assert(!found_ttl);
if (found_ts != api::missing_timestamp && found_ts != row_tomb.timestamp) {
return true;
}
found_ts = row_tomb.timestamp;
}
}
if (!p.row_tombstones().empty() && (had_static_row || had_clustered_row)) {
return true;
}
for (const auto& rt: p.row_tombstones()) {
if (rt.tomb) {
if (found_ts != api::missing_timestamp && found_ts != rt.tomb.timestamp) {
return true;
}
found_ts = rt.tomb.timestamp;
}
}
if (p.partition_tombstone().timestamp != api::missing_timestamp
&& (!p.row_tombstones().empty() || had_static_row || had_clustered_row)) {
return true;
}
// A mutation with no timestamp will be split into 0 mutations
return found_ts == api::missing_timestamp;
}
void for_each_change(const mutation& base_mutation, const schema_ptr& base_schema,
seastar::noncopyable_function<void(mutation, api::timestamp_type, bytes, int&)> f) {
auto changes = extract_changes(base_mutation, *base_schema);
auto pk = base_mutation.key();
for (auto& [change_ts, btch] : changes) {
auto tuuid = timeuuid_type->decompose(generate_timeuuid(change_ts));
int batch_no = 0;
for (auto& sr_update : btch.static_updates) {
mutation m(base_schema, pk);
for (auto& atomic_update : sr_update.atomic_entries) {
auto& cdef = base_schema->column_at(column_kind::static_column, atomic_update.id);
m.set_static_cell(cdef, std::move(atomic_update.cell));
}
for (auto& nonatomic_delete : sr_update.nonatomic_deletions) {
auto& cdef = base_schema->column_at(column_kind::static_column, nonatomic_delete.id);
m.set_static_cell(cdef, collection_mutation_description{nonatomic_delete.t, {}}.serialize(*cdef.type));
}
for (auto& nonatomic_update : sr_update.nonatomic_updates) {
auto& cdef = base_schema->column_at(column_kind::static_column, nonatomic_update.id);
m.set_static_cell(cdef, collection_mutation_description{{}, std::move(nonatomic_update.cells)}.serialize(*cdef.type));
}
f(std::move(m), change_ts, tuuid, batch_no);
}
for (auto& cr_insert : btch.clustered_inserts) {
mutation m(base_schema, pk);
auto& row = m.partition().clustered_row(*base_schema, cr_insert.key);
for (auto& atomic_update : cr_insert.atomic_entries) {
auto& cdef = base_schema->column_at(column_kind::regular_column, atomic_update.id);
row.cells().apply(cdef, std::move(atomic_update.cell));
}
for (auto& nonatomic_delete : cr_insert.nonatomic_deletions) {
auto& cdef = base_schema->column_at(column_kind::regular_column, nonatomic_delete.id);
row.cells().apply(cdef, collection_mutation_description{nonatomic_delete.t, {}}.serialize(*cdef.type));
}
row.apply(cr_insert.marker);
f(std::move(m), change_ts, tuuid, batch_no);
}
for (auto& cr_update : btch.clustered_updates) {
mutation m(base_schema, pk);
auto& row = m.partition().clustered_row(*base_schema, cr_update.key).cells();
for (auto& atomic_update : cr_update.atomic_entries) {
auto& cdef = base_schema->column_at(column_kind::regular_column, atomic_update.id);
row.apply(cdef, std::move(atomic_update.cell));
}
for (auto& nonatomic_delete : cr_update.nonatomic_deletions) {
auto& cdef = base_schema->column_at(column_kind::regular_column, nonatomic_delete.id);
row.apply(cdef, collection_mutation_description{nonatomic_delete.t, {}}.serialize(*cdef.type));
}
for (auto& nonatomic_update : cr_update.nonatomic_updates) {
auto& cdef = base_schema->column_at(column_kind::regular_column, nonatomic_update.id);
row.apply(cdef, collection_mutation_description{{}, std::move(nonatomic_update.cells)}.serialize(*cdef.type));
}
f(std::move(m), change_ts, tuuid, batch_no);
}
for (auto& cr_delete : btch.clustered_row_deletions) {
mutation m(base_schema, pk);
m.partition().apply_delete(*base_schema, cr_delete.key, cr_delete.t);
f(std::move(m), change_ts, tuuid, batch_no);
}
for (auto& crange_delete : btch.clustered_range_deletions) {
mutation m(base_schema, pk);
m.partition().apply_delete(*base_schema, crange_delete.rt);
f(std::move(m), change_ts, tuuid, batch_no);
}
if (btch.partition_deletions) {
mutation m(base_schema, pk);
m.partition().apply(btch.partition_deletions->t);
f(std::move(m), change_ts, tuuid, batch_no);
}
}
}
} // namespace cdc

38
cdc/split.hh Normal file
View File

@@ -0,0 +1,38 @@
/*
* Copyright (C) 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <vector>
#include "schema_fwd.hh"
#include "timestamp.hh"
#include "bytes.hh"
#include <seastar/util/noncopyable_function.hh>
class mutation;
namespace cdc {
bool should_split(const mutation& base_mutation, const schema& base_schema);
void for_each_change(const mutation& base_mutation, const schema_ptr& base_schema,
seastar::noncopyable_function<void(mutation, api::timestamp_type, bytes, int&)>);
}

120
cdc/stats.hh Normal file
View File

@@ -0,0 +1,120 @@
/*
* Copyright (C) 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <array>
#include <cstdint>
#include <string>
#include <seastar/core/metrics_registration.hh>
#include "enum_set.hh"
#include "utils/histogram.hh"
#include "utils/estimated_histogram.hh"
namespace cdc {
class stats final {
seastar::metrics::metric_groups _metrics;
public:
enum class part_type {
STATIC_ROW,
CLUSTERING_ROW,
MAP,
SET,
LIST,
UDT,
RANGE_TOMBSTONE,
PARTITION_DELETE,
ROW_DELETE,
MAX
};
using part_type_set = enum_set<super_enum<part_type,
part_type::STATIC_ROW,
part_type::CLUSTERING_ROW,
part_type::MAP,
part_type::SET,
part_type::LIST,
part_type::UDT,
part_type::RANGE_TOMBSTONE,
part_type::PARTITION_DELETE,
part_type::ROW_DELETE
>>;
struct parts_touched_stats final {
std::array<uint64_t, (size_t)part_type::MAX> count = {};
inline void apply(part_type_set parts_set) {
for (part_type idx : parts_set) {
count[(size_t)idx]++;
}
}
void register_metrics(seastar::metrics::metric_groups& metrics, std::string_view suffix);
};
struct counters final {
uint64_t unsplit_count = 0;
uint64_t split_count = 0;
uint64_t preimage_selects = 0;
uint64_t with_preimage_count = 0;
uint64_t with_postimage_count = 0;
parts_touched_stats touches;
};
counters counters_total;
counters counters_failed;
stats();
};
// Contains the details on what happened during a CDC operation.
struct operation_details final {
stats::part_type_set touched_parts;
bool was_split = false;
bool had_preimage = false;
bool had_postimage = false;
};
// This object tracks the lifetime of write handlers related to one CDC operation. After all
// write handlers for the operation finish, CDC metrics are updated.
class operation_result_tracker final {
stats& _stats;
operation_details _details;
bool _failed;
public:
operation_result_tracker(stats& stats, operation_details details)
: _stats(stats)
, _details(details)
, _failed(false)
{}
~operation_result_tracker();
void on_mutation_failed() {
_failed = true;
}
};
}

View File

@@ -22,7 +22,10 @@
#pragma once
#include "seastar/core/file.hh"
#include "disk-error-handler.hh"
#include "seastar/core/reactor.hh"
#include "utils/disk-error-handler.hh"
#include "seastarx.hh"
class checked_file_impl : public file_impl {
public:

View File

@@ -19,6 +19,23 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <seastar/core/print.hh>
#include "db_clock.hh"
#include "timestamp.hh"
#include "clocks-impl.hh"
std::atomic<int64_t> clocks_offset;
std::ostream& operator<<(std::ostream& os, db_clock::time_point tp) {
auto t = db_clock::to_time_t(tp);
::tm t_buf;
return os << std::put_time(::gmtime_r(&t, &t_buf), "%Y/%m/%d %T");
}
std::string format_timestamp(api::timestamp_type ts) {
auto t = std::time_t(std::chrono::duration_cast<std::chrono::seconds>(api::timestamp_clock::duration(ts)).count());
::tm t_buf;
return format("{}", std::put_time(::gmtime_r(&t, &t_buf), "%Y/%m/%d %T"));
}

View File

@@ -24,7 +24,7 @@
#include <functional>
#include "keys.hh"
#include "schema.hh"
#include "schema_fwd.hh"
#include "range.hh"
/**

134
clustering_interval_set.hh Normal file
View File

@@ -0,0 +1,134 @@
/*
* Copyright (C) 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "schema_fwd.hh"
#include "position_in_partition.hh"
#include <boost/icl/interval_set.hpp>
// Represents a non-contiguous subset of clustering_key domain of a particular schema.
// Can be treated like an ordered and non-overlapping sequence of position_range:s.
class clustering_interval_set {
// Needed to make position_in_partition comparable, required by boost::icl::interval_set.
class position_in_partition_with_schema {
schema_ptr _schema;
position_in_partition _pos;
public:
position_in_partition_with_schema()
: _pos(position_in_partition::for_static_row())
{ }
position_in_partition_with_schema(schema_ptr s, position_in_partition pos)
: _schema(std::move(s))
, _pos(std::move(pos))
{ }
bool operator<(const position_in_partition_with_schema& other) const {
return position_in_partition::less_compare(*_schema)(_pos, other._pos);
}
bool operator==(const position_in_partition_with_schema& other) const {
return position_in_partition::equal_compare(*_schema)(_pos, other._pos);
}
const position_in_partition& position() const { return _pos; }
};
private:
// We want to represent intervals of clustering keys, not position_in_partitions,
// but clustering_key domain is not enough to represent all kinds of clustering ranges.
// All intervals in this set are of the form [x, y).
using set_type = boost::icl::interval_set<position_in_partition_with_schema>;
using interval = boost::icl::interval<position_in_partition_with_schema>;
set_type _set;
public:
clustering_interval_set() = default;
// Constructs from legacy clustering_row_ranges
clustering_interval_set(const schema& s, const query::clustering_row_ranges& ranges) {
for (auto&& r : ranges) {
add(s, position_range::from_range(r));
}
}
query::clustering_row_ranges to_clustering_row_ranges() const {
query::clustering_row_ranges result;
for (position_range r : *this) {
result.push_back(query::clustering_range::make(
{r.start().key(), r.start()._bound_weight != bound_weight::after_all_prefixed},
{r.end().key(), r.end()._bound_weight == bound_weight::after_all_prefixed}));
}
return result;
}
class position_range_iterator : public std::iterator<std::input_iterator_tag, const position_range> {
set_type::iterator _i;
public:
position_range_iterator(set_type::iterator i) : _i(i) {}
position_range operator*() const {
// FIXME: Produce position_range view. Not performance critical yet.
const interval::interval_type& iv = *_i;
return position_range{iv.lower().position(), iv.upper().position()};
}
bool operator==(const position_range_iterator& other) const { return _i == other._i; }
bool operator!=(const position_range_iterator& other) const { return _i != other._i; }
position_range_iterator& operator++() {
++_i;
return *this;
}
position_range_iterator operator++(int) {
auto tmp = *this;
++_i;
return tmp;
}
};
static interval::type make_interval(const schema& s, const position_range& r) {
assert(r.start().has_clustering_key());
assert(r.end().has_clustering_key());
return interval::right_open(
position_in_partition_with_schema(s.shared_from_this(), r.start()),
position_in_partition_with_schema(s.shared_from_this(), r.end()));
}
public:
bool equals(const schema& s, const clustering_interval_set& other) const {
return boost::equal(_set, other._set);
}
bool contains(const schema& s, position_in_partition_view pos) const {
// FIXME: Avoid copy
return _set.find(position_in_partition_with_schema(s.shared_from_this(), position_in_partition(pos))) != _set.end();
}
// Returns true iff this set is fully contained in the other set.
bool contained_in(clustering_interval_set& other) const {
return boost::icl::within(_set, other._set);
}
bool overlaps(const schema& s, const position_range& range) const {
// FIXME: Avoid copy
auto r = _set.equal_range(make_interval(s, range));
return r.first != r.second;
}
// Adds given clustering range to this interval set.
// The range may overlap with this set.
void add(const schema& s, const position_range& r) {
_set += make_interval(s, r);
}
void add(const schema& s, const clustering_interval_set& other) {
for (auto&& r : other) {
add(s, r);
}
}
position_range_iterator begin() const { return {_set.begin()}; }
position_range_iterator end() const { return {_set.end()}; }
friend std::ostream& operator<<(std::ostream&, const clustering_interval_set&);
};

View File

@@ -23,7 +23,7 @@
#pragma once
#include "schema.hh"
#include "schema_fwd.hh"
#include "query-request.hh"
namespace query {

Some files were not shown because too many files have changed in this diff Show More