Commit Graph

2200 Commits

Author SHA1 Message Date
Avi Kivity
1fd701e709 test: cql-pytest: skip tests depending on timeuuid monotonicity
timeuuid is not monotonic when now() is called on different connections,
so when running tests that depend on that property, we get failures if
using the Scylla driver (which became standard in 729d0fe).

Skip the tests for now, until we figure out what to do. We probably
can't make now() globally monotonic, and there isn't much to gain
by making it monotonic only per connection, since clients are allowed
to switch connections (and even nodes) at will.

Ref #9300

Closes #9323

[avi: committing my own patch to unblock master]
2021-09-12 19:30:40 +03:00
Nadav Har'El
1d4474d543 test/alternator/run: don't run Scylla if "--aws" option
The test/alternator/run script runs Scylla and then runs pytest against
it. But when passing the "--aws" option, the intention is that these
tests be run against AWS DynamoDB, not a local Scylla, so there is no
point in starting Scylla at all - so this is what we do in this patch.

This doesn't really add a new feature - "test/alternator/run --aws"
will now be nothing more than "cd test/alternator; pytest --aws".
But it adds the convenience that you can run the same tests on Scylla
and AWS with exactly the same "run" command, just adding the "--aws"
option, and don't need to sometimes use "run" and sometimes "pytest".

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210912133239.75463-1-nyh@scylladb.com>
2021-09-12 16:50:38 +03:00
Avi Kivity
c5f52f9d97 schema_tables: don't flush in tests
Flushing schema tables is important for crash recovery (without a flush,
we might have sstables using a new schema before the commitlog entry
noting the schema change has been replayed), but not important for tests
that do not test crash recovery. Avoiding those flushes reduces system,
user, and real time on tests running on a consumer-level SSD.

before:
real	8m51.347s
user	7m5.743s
sys	5m11.185s

after:
real	7m4.249s
user	5m14.085s
sys	2m11.197s

Note real time is higher that user+sys time divided by the number
of hardware threads, indicating that there is still idle time due
to the disk flushing, so more work is needed.

Closes #9319
2021-09-12 11:32:13 +03:00
Raphael S. Carvalho
acba3bd3c4 sstables: give a more descriptive name to compaction_options
the name compaction_options is confusing as it overlaps in meaning
with compaction_descriptor. hard to reason what are the exact
difference between them, without digging into the implementation.

compaction_options is intended to only carry options specific to
a give compaction type, like a mode for scrub, so let's rename
it to compaction_type_options to make it clearer for the
readers.

[avi: adjust for scrub changes]
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210908003934.152054-1-raphaelsc@scylladb.com>
2021-09-12 11:21:33 +03:00
Tomasz Grabiec
83113d8661 Merge "raft: new schema for storing raft snapshots" from Pavel Solodovnikov
Previously, the layout for storing raft snapshot
descriptors contained a `config` field, which had `blob`
data type.

That means `raft::configuration` for the snapshot was serialized
as a whole in binary form. It's convenient to implement and
is the most compact form of representing the data, but:

1. Hard to debug due to the need to de-serialize the data.
2. Plants a time bomb wrt. changing data layout and also the
   documentation in the future.

Remove the `config` field from `system.raft_snapshots` and
extract it to a separate `system.raft_config` table to store
the data in exploded form.

Also, modify the schema of `system.raft_snapshots` table in
the following way: add a `server_id` field as a part of
composite partition key ((group_id, server_id)) to
be able to start multiple raft servers belonging to one raft
group on the same scylla node.

Rename `id` field in `raft_snapshots` to `snapshot_id` so
it's self-documenting.

Rename `snapshot_id` from clustering key since a given server
can have only one snapshot installed at a time.

Note that the `raft::server_address` stucture contains an opaque
`info` member, which is `bytes`, but in the `raft_config` table
we use `ip_addr inet` field, instead. We always know that the
corresponding member field is going to contain an IP address (either v4
or v6) of a given raft server.

So, now the snapshots schema looks like this:

    CREATE TABLE raft_snapshots (
        group_id timeuuid,
        server_id uuid,
        snapshot_id uuid,
        idx int,
        term int,
        -- no `config` field here, moved to `raft_config` table
        PRIMARY KEY ((group_id, server_id))
    )

    CREATE TABLE raft_config (
         group_id timeuuid,
         my_server_id uuid,
         server_id uuid,
         disposition text, -- can be either 'CURRENT` or `PREVIOUS'
         can_vote bool,
         ip_addr inet,
         PRIMARY KEY ((group_id, my_server_id), server_id, disposition)
    );

This way it's much easier to extend the schema with new fields,
very easy to debug and inspect via CQL, and it's much more descriptive
in terms of self-documentation.

Tests: unit(dev)

* manmanson/raft_snapshots_new_schema_v2:
  test: adjust `schema_change_test` to include new `system.raft_config` table
  raft: new schema for storing raft snapshots
  raft: pass server id to `raft_sys_table_storage` instance
2021-09-10 20:41:59 +02:00
Avi Kivity
7a798b44a2 cql3: expr: replace column_value_tuple by a composition of tuple_constructor and column_value
column_value_tuple overlaps both column_value and tuple_constructor
(in different respects) and can be replaced by a combination: a
tuple_constructor of column_value. The replacement is more expressive
(we can have a tuple of column_value and other expression types), though
the code (especially grammar) do not allow it yet.

So remove column_value_tuple and replace it everywhere with
tuple_constructor. Visitors get the merged behavior of the existing
tuple_constructor and column_value_tuple, which is usually trivial
since tuple_constructor and column_value_tuple came from different
hierarchies (term::raw and relation), so usually one of the types
just calls on_internal_error().

The change results in awkwards casts in two areas: WHERE clause
filtering (equal() and related), and clustering key range evaluations
(limits() and related). When equal() is replaced by recursive
evaluate(), the casts will go way (to be replaced by the evaluate())
visitor. Clustering key range extraction will remain limited
to tuples of column_value, so the prepare phase will have to vet
the expressions to ensure the casts don't fail (and use the
filtering path if they will).

Tests: unit (dev)

Closes #9274
2021-09-10 10:43:29 +02:00
Avi Kivity
219fdcd8da Merge 'tools: introduce scylla-sstable' from Botond Dénes
A tool which can be used to examine the content of sstable(s) and
execute various operations on them. The currently supported operations
are:
* dump - dumps the content of the sstable(s), similar to sstabledump;
* dump-index - dumps the content of the sstable index(es), similar to scylla-sstable-index;
* writetime-histogram - generates a histogram of all the timestamps in
  the sstable(s);
* custom - a hackable operation for the expert user (until scripting
  support is implemented);
* validate - validate the content of the sstable(s) with the mutation
  fragment stream validator, same as scrub in validate mode;

The sstables to-be-examined are passed as positional command line
arguments. Sstables will be processed by the selected operation
one-by-one (can be changed with `--merge`). Any number of sstables can
be passed but mind the open file limits. Pass the full path to the data
component of the sstables (*-Data.db). For now it is required that the
sstable is found at a valid data path:

    /path/to/datadir/{keyspace_name}/{table_name}-{table_id}/

The schema to read the sstables is read from a `schema.cql` file. This
should contain the keyspace and table definitions, as well as any UDTs
used.
Filtering the sstable(s) to process only certain partition(s) is supported
via the `--partition` and `--partitions-file` command line flags.
Partition keys are expected to be in the hexdump format used by scylla
(hex representation of the raw buffer).
Operations write their output to stdout, or file(s). The tool logs to
stderr, with a logger called `scylla-sstable-crawler`.

Examples:

    # dump the content of the sstable
    $ scylla-sstable-crawler --dump /path/to/md-123456-big-Data.db

    # dump the content of the two sstable(s) as a unified stream
    $ scylla-sstable-crawler --dump --merge /path/to/md-123456-big-Data.db /path/to/md-123457-big-Data.db

    # generate a joint histogram for the specified partition
    $ scylla-sstable-crawler --writetime-histogram --partition={{myhexpartitionkey}} /path/to/md-123456-big-Data.db

    # validate the specified sstables
    $ scylla-sstable-crawler --validate /path/to/md-123456-big-Data.db /path/to/md-123457-big-Data.db

Future plans:
* JSON output for dump.
* A simple way of generating `schema.cql` for any schema, other than copying it from snapshots, or copying from `cqlsh`. None of these generate a complete output.
* Relax sstable path checks, so sstables can be loaded from any path.
* Add scripting support (Lua), allowing custom operations to be written
  in a scripting language.

Refs: #9241

Closes #9271

* github.com:scylladb/scylla:
  tools: remove scylla-sstable-index
  tools: introduce scylla-sstable
  tools: extract finding selected operation (handler) into function
  tools: add schema_loader
  cql3: query_processor: add parse_statements()
  cql3: statements/create_type: expose create_type()
  cql3: statements/create_keyspace: add get_keyspace_metadata()
2021-09-09 19:24:06 +03:00
Avi Kivity
c1028de22a Merge 'Introduce native reversed format' from Botond Dénes
We define the native reverse format as a reversed mutation fragment
stream that is identical to one that would be emitted by a table with
the same schema but with reversed clustering order. The main difference
to the current format is how range tombstones are handled: instead of
looking at their start or end bound depending on the order, we always
use them as-usual and the reversing reader swaps their bounds to
facilitate this. This allows us to treat reversed streams completely
transparently: just pass along them a reversed schema and all the
reader, compacting and result building code is happily ignorant about
the fact that it is a reversed stream.

This series is the first step towards implementing efficient reverse
reads. It allows us to remove all the special casing we have in various
places for reverse reads and thus treating reverse streams transparently
in all the middle layers. The only layers that have to know about the
actual reversing are mutation sources proper. The plan is that when
reading in reverse we create a reversed schema in the top layer then
pass this down as the schema for the read. There are two layers that
will need to act on this reversed schema:
* The layer sitting on top of the first layer which still can't handle
  reversed streams, this layer will create a reversed reader to handle
  the transition.
* The mutation source proper: which will obtain the underlying schema
  and will emit the data in reverse order.

Once all the mutation sources are able to handle reverse reads, we can
get rid of the reverse reader entirely.

Refs: #1413

Tests: unit(dev)

TODO:
* v2
* more testing

Also on: https://github.com/denesb/scylla.git reverse-reads/v3

Changelog

v3:
* Drop the entire schema transformation mechanism;
* Drop reversing from `schema_builder()`;
* Don't keep any information about whether the schema is reversed or not
  in the schema itself, instead make reversing deterministic w.r.t.
  schema version, such that:
  `s.version() == s.make_reversed().make_reversed().version()`;
* Re-reverse range tombstones in `streaming_mutation_freezer`, so
  `reconcilable_results` sent to the coordinator during read repair
  still use the old reverse format;

v2:
* Add `data_type reversed(data_type)`;
* Add `bound_kind reverse_kind(bound_kind)`;
* Make new API safer to use:
    - `schema::underlying_type()`: return this when unengaged;
    - `schema::make_transformed()`: noop when applying the same
      transformation again;
* Generalize reversed into transformation. Add support to transferring
  to remote nodes and shards by way of making `schema_tables` aware of
  the transformation;
* Use reverse schema everywhere in reverse reader;

Closes #9184

* github.com:scylladb/scylla:
  range_tombstone_accumulator: drop _reversed flag
  test/boost/mutation_test: add test for mutation::consume() monotonicity
  test/boost/flat_mutation_reader_test: more reversed reader tests
  flat_mutation_reader: make_reversing_reader(): implement fast_forward_to(partition_range)
  flat_mutation_reader: make_reversing_reader(): take ownership of the reader
  test/lib/mutation_source_test: add consistent log to all methods
  mutation: introduce reverse()
  mutation_rebuilder: make it standalone
  mutation: make copy constructor compatible with mutation_opt
  treewide: switch to native reversed format for reverse reads
  mutation: consume(): add native reverse order
  mutation: consume(): don't include dummy rows
  query: add slice reversing functions
  partition_slice_builder: add range mutating methods
  partition_slice_builder: add constructor with slice
  query: specific_ranges: add non-const ranges accessor
  range_tombstone: add reverse()
  clustering_bounds_comparator: add reverse_kind()
  schema: introduce make_reversed()
  schema: add a transforming copy constructor
  utils: UUID_gen: introduce negate()
  types: add reversed(data_type)
  docs: design-notes: add reverse-reads.md
2021-09-09 15:50:22 +03:00
Botond Dénes
f02632aeb0 range_tombstone_accumulator: drop _reversed flag 2021-09-09 15:42:15 +03:00
Botond Dénes
f07805c3ef test/boost/mutation_test: add test for mutation::consume() monotonicity
In both forward and reverse modes.
2021-09-09 15:42:15 +03:00
Botond Dénes
3cc882f6a8 test/boost/flat_mutation_reader_test: more reversed reader tests
Check that the reverse reader emits a stream identical to that emitted
by a reader reading in native order from a table with reversed
clustering order.
2021-09-09 15:42:15 +03:00
Botond Dénes
350440b418 flat_mutation_reader: make_reversing_reader(): take ownership of the reader
Makes for much simpler client code.
2021-09-09 15:42:15 +03:00
Botond Dénes
c71a281e6b test/lib/mutation_source_test: add consistent log to all methods
Most test methods log their own name either via testlog.info() or
BOOST_TEST_MESSAGE() so failures can be more easily located. Not all do
however. This commit fixes this and also converts all those using
BOOST_TEST_MESSAGE() for this to testlog.info(), for consistency.
2021-09-09 15:42:15 +03:00
Botond Dénes
74a22a706b mutation_rebuilder: make it standalone
Not requiring a wrapper object to become usable.
2021-09-09 15:42:15 +03:00
Botond Dénes
502a45ad58 treewide: switch to native reversed format for reverse reads
We define the native reverse format as a reversed mutation fragment
stream that is identical to one that would be emitted by a table with
the same schema but with reversed clustering order. The main difference
to the current format is how range tombstones are handled: instead of
looking at their start or end bound depending on the order, we always
use them as-usual and the reversing reader swaps their bounds to
facilitate this. This allows us to treat reversed streams completely
transparently: just pass along them a reversed schema and all the
reader, compacting and result building code is happily ignorant about
the fact that it is a reversed stream.
2021-09-09 15:42:15 +03:00
Botond Dénes
f200c8104a schema: introduce make_reversed()
`make_revered()` creates a schema identical to the schema instance it is
called on, with clustering order reversed. To distinguish the reverse
schema from the original one, the node-id part of its version UUID is
bit-flipped. This ensures that reversing a schema twice will result in
the identical schema to the original one (although a different C++
object).

This reversed schema will be used in reversed reads, so intermediate
layers can be ignorant of the fact that the read happens in reverse.
2021-09-09 11:49:05 +03:00
Botond Dénes
65913f4cfa utils: UUID_gen: introduce negate() 2021-09-09 11:49:05 +03:00
Dejan Mircevski
6afdc6004c cql3/modification_statement: Replace empty-range check with null check
The empty-range check causes more bugs than it fixes.  Replace it with
an explicit check for =NULL (see #7852).

Fixes #9311.
Fixes #9290.

Tests: unit (dev), cql-pytest on Cassandra 4.0

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9314
2021-09-09 10:56:13 +03:00
Dejan Mircevski
58a9a24ff0 cql3: Allow indexed query to select static columns
We previously forbade selecting a static column when an index is
used.  But Cassandra allows it, so we should, too -- see #8869.

After removing the static-column check, the existing code gets the
correct result without any further changes (though it may read
multiple rows from the same partition).

Fixes #8869.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9307
2021-09-08 08:22:59 +02:00
Tomasz Grabiec
9a77a03ea1 Merge "Remove most uses of gms::get_gossiper(), gms::get_local_gossiper()" from Avi
In the quest to have explicit dependencies and the abiliy to run
multiple nodes in one process, remove some uses of get_gossiper() and
get_local_gossiper() and replace them with parameters passed from main()
or its equivalents.

Some uses still remain, mostly in snitch, but this series removes a
majority.

* https://github.com/avikivity/scylla.git gossiper-deglobal-1/v1
  alternator: remove uses of get_local_gossiper()
  storage_service: remove stray get_gossiper(), get_local_gossiper() calls
  migration_manager: remove use of get_gossiper() from passive_announce()
  storage_proxy: start_hints_manager(): don't require caller to provide gossiper
  migration_manager: remove uses of get_local_gossiper()
  storage_proxy: remove uses of get_local_gossiper()
  gossiper: remove get_local_gossiper() from some inline helpers
  gossiper: remove get_gossiper() from stop_gossiping()
  gossiper: remove uses of get_local_gossiper for its rpc server
  api: remove use of get_local_gossiper()
  gossiper: remove calls to global get_gossiper from within the gossiper itself
2021-09-07 20:02:30 +02:00
Avi Kivity
d8f7903f60 migration_manager: remove uses of get_local_gossiper()
Pass gossiper as a constructor parameter instead. cql_test_env
gains a use of get_gossiper() instead, but at least these uses
are concentrated in one place.
2021-09-07 20:08:11 +03:00
Avi Kivity
71081be99c storage_proxy: remove uses of get_local_gossiper()
Pass the gossiper as a constructor parameter instead.
2021-09-07 17:14:09 +03:00
Avi Kivity
aa68927873 gossiper: remove get_local_gossiper() from some inline helpers
Some state accessors called get_local_gossiper(); this is removed
and replaced with a parameter. Some callers (redis, alternators)
now have the gossiper passed as a parameter during initialization
so they can use the adjusted API.
2021-09-07 17:03:37 +03:00
Avi Kivity
9ce1af9fcb gossiper: remove get_gossiper() from stop_gossiping()
Have the callers pass it instead, and they all have a reference
already except for cql_test_env (which will be fixed later).

The checks for initialization it does are likely unnecessary, but
we'll only be able to prove it when get_gossiper() is completely
removed.
2021-09-07 16:20:04 +03:00
Botond Dénes
23a56beccc tools: add schema_loader
A utility which can load a schema from a schema.cql file. The file has
to contain all the "dependencies" of the table: keyspace, UDTs, etc.
This will be used by the scylla-sstable-crawler in the next patch.
2021-09-07 15:47:22 +03:00
Benny Halevy
b7eaa22ce6 abstract_replication_strategy: create_replication_strategy: drop keyspace name parameter
It is not used.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210906133840.3307279-1-bhalevy@scylladb.com>
2021-09-06 16:51:21 +03:00
Avi Kivity
dfc135dbd1 Merge "Keep range_tombstone apart from list linkage" from Pavel E
"
There's a landmine buried in range_rombstone's move constructor.
Whoever tries to use it risks grabbing the tombstone from the
containing list thus leaking the guy optionally invalidating an
iterator pointing at it. There's a safety without_link moving
constructor out there, but still.

To keep this place safe it's better to separate range_tombstone
from its linkage into anywhere. In particular to keep the range
tombstones in a range_tombstone_list here's the entry that keeps
the tombstone _and_ the list hook (which's a boost set hook).

The approach resembles the rows_entry::deletable_row pair.

tests: unit(dev, debug, patch from #9207)
fixes: #9243
"

* 'br-range-tombstone-vs-entry' of https://github.com/xemul/scylla:
  range_tombstone: Drop without-link constructor
  range_tombstone: Drop move_assign()
  range_tombstone: Move linkage into range_tombstone_entry
  range_tombstone_list: Prepare to use range_tombstone_entry
  range_tombstone, code: Add range_tombstone& getters
  range_tombstone_list: Factor out tombstone construction
  range_tombstone_list: Simplify (maybe) pop_front_and_lock()
  range_tombstone_list: De-templatize pop_as<>
  range_tombstone_list: Conceptualize erase_where()
  range_tombstone(_list): Mark some bits noexcept
  mutation: Use range_tombstone_list's iterators
  mutation_partition: Shorten memory usage calculation
  mutation_partition: Remove unused local variable
2021-09-05 17:26:13 +03:00
Dejan Mircevski
1fdaeca7d0 cql3: Reject updates with NULL key values
We were silently ignoring INSERTs with NULL values for primary-key
columns, which Cassandra rejects.  Fix it by rejecting any
modification_statement that would operate on empty partition or
clustering range.

This is the most direct fix, because range and slice are calculated in
one place for all modification statements.  It covers not only NULL
cases, but also impossible restrictions like c>0 AND c<0.
Unfortunately, Cassandra doesn't treat all modification statements
consistently, so this fix cannot fully match its behavior.  We err on
the side of tolerance, accepting some DELETE statements that Cassandra
rejects.  We add a TODO for rejecting such DELETEs later.

Fixes #7852.

Tests: unit (dev), cql-pytest against Cassandra 4.0

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9286
2021-09-05 10:23:28 +03:00
Pavel Emelyanov
d6af441eaa range_tombstone: Move linkage into range_tombstone_entry
Now it's time to remove the boost set's hook from the range_tombstone
and keep it wrapped into another class if the r._tombstone's location
is the range_tombstone_list.

Also the added previously .tombstone() getters and the _entry alias
can be removed -- all the code can work with the new class.

Two places in the code that made use of without_link{} move-constructor
are patched to get the range_tombstone part from the respective _entry
with the same result.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:45 +03:00
Pavel Emelyanov
5515f7187d range_tombstone, code: Add range_tombstone& getters
Currently all the code operates on the range_tombstone class.
and many of those places get the range tombstone in question
from the range_tombstone_list. Next patches will make that list
carry (and return) some new object called range_tombstone_entry,
so all the code that expects to see the former one there will
need to patched to get the range_tombstone from the _entry one.

This patch prepares the ground for that by introdusing the

    range_tombstone& tombstone() { return *this; }

getter on the range_tombstone itself and patching all future
users of the _entry to call .tombstone() right now.

Next patch will remove those getters together with adding the new
range_tombstone_entry object thus automatically converting all
the patched places into using the entry in a proper way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:45 +03:00
Nadav Har'El
b3f4a37a75 test/alternator: verify that nulls are valid inside string and bytes
The tests in this patch verify that null characters are valid characters
inside string and bytes (blob) attributes in Alternator. The tests
verify this for both key attributes and non-key attributes (since those
are serialized differently, it's important to check both cases).

The tests pass on both DynamoDB and Alternator - confirming that we
don't have a bug in this area.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210824163442.186881-1-nyh@scylladb.com>
2021-09-03 08:49:06 +02:00
Nadav Har'El
068c4283b7 test/cql-pytest: add tests for some undocumented cases of string types
This patch adds tests for two undocumented (as far as I can tell) corner
cases of CQL's string types:

1. The types "text" and "varchar" are not just similar - they are in
   fact exactly the same type.

2. All CQL string and blob types ("ascii", "text" or "varchar", "blob")
   allow the null character as a valid character inside them. They are
   *not* C strings that get terminated by the first null.

These tests pass on both Cassandra and Scylla, so did not expose any
bug, but having such tests is useful to understand these (so-far)
undocumented behaviors - so we can later document them.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210824225641.194146-1-nyh@scylladb.com>
2021-09-02 15:45:47 +03:00
Avi Kivity
403645f58c Merge "raft: miscellaneous fixes" from Gleb
* 'raft-misc-v3' of github.com:scylladb/scylla-dev:
  raft: rename snapshot into snapshot_descriptor
  raft: drop snapshot if is application failed
  raft: fix local snapshot detection
  raft: replication_test: store multiple snapshots in a state machine
  raft: do not wait for entry to become stable before replicate it
2021-09-02 11:25:06 +03:00
Liu Lan
a5c54867f8 alternator: Exclusive start key must lie within the segment
...when using Segment/TotalSegment option.

The requirement is not specified in DynamoDB documents, but found
in DynamoDB Local:

{"__type":"com.amazon.coral.validate#ValidationException",
"message":"Exclusive start key must lie within the segment"}

Fixes #9272

Signed-off-by: Liu Lan <liulan_yewu@cmss.chinamobile.com>

Closes #9270
2021-09-01 11:05:45 +03:00
Avi Kivity
8b59e3a0b1 Merge ' cql3: Demand ALLOW FILTERING for unlimited, sliced partitions ' from Dejan Mircevski
Return the pre- 6773563d3 behavior of demanding ALLOW FILTERING when partition slice is requested but on potentially unlimited number of partitions.  Put it on a flag defaulting to "off" for now.

Fixes #7608; see comments there for justification.

Tests: unit (debug, dev), dtest (cql_additional_test, paging_test)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9126

* github.com:scylladb/scylla:
  cql3: Demand ALLOW FILTERING for unlimited, sliced partitions
  cql3: Track warnings in prepared_statement
  test: Use ALLOW FILTERING more strictly
  cql3: Add statement_restrictions::to_string
2021-08-31 18:05:26 +03:00
Dejan Mircevski
2f28f68e84 cql3: Demand ALLOW FILTERING for unlimited, sliced partitions
When a query requests a partition slice but doesn't limit the number
of partitions, require that it also says ALLOW FILTERING.  Although
do_filter() isn't invoked for such queries, the performance can still
be unexpectedly slow, and we want to signal that to the user by
demanding they explicitly say ALLOW FILTERING.

Because we now reject queries that worked fine before, existing
applications can break.  Therefore, the behavior is controlled by a
flag currently defaulting to off.  We will default to "on" in the next
Scylla version.

Fixes #7608; see comments there for justification.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-08-31 10:45:41 -04:00
Pavel Emelyanov
e26a6c1acc btree, test: Test exception safety and non-leakness of btree::clone_from
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-31 12:23:49 +03:00
Pavel Emelyanov
da38038222 btree, test: Test key copy constructor may throw
It calls the tree_test_key_base copy constructor which
is throwing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-31 12:23:49 +03:00
Gleb Natapov
ce40b01b07 raft: rename snapshot into snapshot_descriptor
The snapshot structure does not contain the snapshot itself but only
refers to it trough its id. Rename it to snapshot_descriptor for clarity.
2021-08-29 12:53:03 +03:00
Gleb Natapov
80a392a444 raft: replication_test: store multiple snapshots in a state machine
State machine should be able to store more then one snapshot at a time
(one may be the currently used one and another is transferred from a
leader but not applied yet).
2021-08-29 12:53:03 +03:00
Gleb Natapov
5e1d589872 raft: do not wait for entry to become stable before replicate it
Since io_fiber persist entries before sending out messages even non
stable entries will become stable before observed by other nodes.

This patch also moves generation of append messages into get_outptut()
call because without the change we will lose batching since each
advance of last_idx will generate new append message.
2021-08-29 12:48:15 +03:00
Pavel Emelyanov
60a7ca62f2 storage_service: Drop .enable_all_features()
This method has nothing to do with storage service and
is only needed to move feature service options from one
method to another. This can be done by the only caller
of it.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210827133954.29535-1-xemul@scylladb.com>
2021-08-29 11:27:05 +03:00
Pavel Solodovnikov
b00443ab87 test: adjust schema_change_test to include new system.raft_config table
Check that the new table uses null sharder.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-08-27 09:30:17 +03:00
Pavel Solodovnikov
8d3c0ee9b6 raft: new schema for storing raft snapshots
Previously, the layout for storing raft snapshot
descriptors contained a `config` field, which had `blob`
data type.

That means `raft::configuration` for the snapshot was serialized
as a whole in binary form. It's convenient to implement and
is the most compact form of representing the data, but:

1. Hard to debug due to the need to de-serialize the data.
2. Plants a time bomb wrt. changing data layout and also the
   documentation in the future.

Remove the `config` field from `system.raft_snapshots` and
extract it to a separate `system.raft_config` table to store
the data in exploded form.

Also, modify the schema of `system.raft_snapshots` table in
the following way: add a `server_id` field as a part of
composite partition key ((group_id, server_id)) to
be able to start multiple raft servers belonging to one raft
group on the same scylla node.

Rename `id` field in `raft_snapshots` to `snapshot_id` so
it's self-documenting.

Rename `snapshot_id` from clustering key since a given server
can have only one snapshot installed at a time.

Note that the `raft::server_address` stucture contains an opaque
`info` member, which is `bytes`, but in the `raft_config` table
we use `ip_addr inet` field, instead. We always know that the
corresponding member field is going to contain an IP address (either v4
or v6) of a given raft server.

So, now the snapshots schema looks like this:

    CREATE TABLE raft_snapshots (
        group_id timeuuid,
        server_id uuid,
        snapshot_id uuid,
        idx int,
        term int,
        -- no `config` field here, moved to `raft_config` table
        PRIMARY KEY ((group_id, server_id))
    )

    CREATE TABLE raft_config (
         group_id timeuuid,
         my_server_id uuid,
         server_id uuid,
         disposition text, -- can be either 'CURRENT` or `PREVIOUS'
         can_vote bool,
         ip_addr inet,
         PRIMARY KEY ((group_id, my_server_id), server_id, disposition)
    );

This way it's much easier to extend the schema with new fields,
very easy to debug and inspect via CQL, and it's much more descriptive
in terms of self-documentation.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-08-27 09:24:46 +03:00
Pavel Solodovnikov
0a8faee660 raft: pass server id to raft_sys_table_storage instance
Preparations for changing raft snapshots schema.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-08-27 09:24:20 +03:00
Pavel Solodovnikov
c0854a0f62 raft: create system tables only when raft experimental feature is set
Also introduce a tiny function to return raft-enabled db config
for cql testing.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210826091432.279532-1-pa.solodovnikov@scylladb.com>
2021-08-26 12:21:12 +03:00
Avi Kivity
acf8da2bce Merge "flat_mutation_reader: keep timeout in permit" from Benny
"
This series moves the timeout parameter, that is passed to most
f_m_r methods, into the reader_permit.  This eliminates
the need to pass the timeout around, as it's taken
from the permit when needed.

The permit timeout is updated in certain cases
when the permit/reader is paused and retrieved
later on for reuse.

Following are perf_simple_query results showing ~1%
reduction in insns/op and corresponding increase in tps.

$ build/release/test/perf/perf_simple_query -c 1 --operations-per-shard 1000000 --task-quota-ms 10

Before:
102500.38 tps ( 75.1 allocs/op,  12.1 tasks/op,   45620 insns/op)

After:
103957.53 tps ( 75.1 allocs/op,  12.1 tasks/op,   45372 insns/op)

Test: unit(dev)
DTest:
    repair_additional_test.py:RepairAdditionalTest.repair_abort_test (release)
    materialized_views_test.py:TestMaterializedViews.remove_node_during_mv_insert_3_nodes_test (release)
    materialized_views_test.py:InterruptBuildProcess.interrupt_build_process_with_resharding_half_to_max_test (release)
    migration_test.py:TTLWithMigrate.big_table_with_ttls_test (release)
"

* tag 'reader_permit-timeout-v6' of github.com:bhalevy/scylla:
  flat_mutation_reader: get rid of timeout parameter
  reader_concurrency_semaphore: use permit timeout for admission
  reader_concurrency_semaphore: adjust reactivated reader timeout
  multishard_mutation_query: create_reader: validate saved reader permit
  repair: row_level: read_mutation_fragment: set reader timeout
  flat_mutation_reader: maybe_timed_out: use permit timeout
  test: sstable_datafile_test: add sstable_reader_with_timeout
  reader_permit: add timeout member
2021-08-25 17:51:10 +03:00
Avi Kivity
993f824cfd Merge "raft: implement linearisable reads on a follower" from Gleb and Kostja
"
This series implements section 6.4 of the Raft PhD. It allows to do
linearisable reads on a follower bypassing raft log entirely. After this
series server::read_barrier can be executed on a follower as well as
leader and after it completes local user's state machine state can be
accessed directly.
"

* 'raft-read-v9' of github.com:scylladb/scylla-dev:
  raft: test: add read_barrier test to replication_test
  raft: test: add read_barrier tests to fsm_test
  raft: make read_barrier work on a follower as well as on a leader
  raft: add a function to wait for an index to be applied
  raft: (server) add a helper to wait through uncertainty period
  raft: make fsm::current_leader() public
  raft: add hasher for raft::internal::tagged_uint64
  serialize: add serialized for std::monostate
  raft: fix indentation in applier_fiber
2021-08-25 13:11:35 +03:00
Gleb Natapov
3ff6f76cef raft: test: add read_barrier test to replication_test 2021-08-25 08:57:13 +03:00
Gleb Natapov
ad2c2abcb8 raft: test: add read_barrier tests to fsm_test 2021-08-25 08:57:13 +03:00