Commit Graph

2172 Commits

Author SHA1 Message Date
Botond Dénes
23a56beccc tools: add schema_loader
A utility which can load a schema from a schema.cql file. The file has
to contain all the "dependencies" of the table: keyspace, UDTs, etc.
This will be used by the scylla-sstable-crawler in the next patch.
2021-09-07 15:47:22 +03:00
Avi Kivity
dfc135dbd1 Merge "Keep range_tombstone apart from list linkage" from Pavel E
"
There's a landmine buried in range_rombstone's move constructor.
Whoever tries to use it risks grabbing the tombstone from the
containing list thus leaking the guy optionally invalidating an
iterator pointing at it. There's a safety without_link moving
constructor out there, but still.

To keep this place safe it's better to separate range_tombstone
from its linkage into anywhere. In particular to keep the range
tombstones in a range_tombstone_list here's the entry that keeps
the tombstone _and_ the list hook (which's a boost set hook).

The approach resembles the rows_entry::deletable_row pair.

tests: unit(dev, debug, patch from #9207)
fixes: #9243
"

* 'br-range-tombstone-vs-entry' of https://github.com/xemul/scylla:
  range_tombstone: Drop without-link constructor
  range_tombstone: Drop move_assign()
  range_tombstone: Move linkage into range_tombstone_entry
  range_tombstone_list: Prepare to use range_tombstone_entry
  range_tombstone, code: Add range_tombstone& getters
  range_tombstone_list: Factor out tombstone construction
  range_tombstone_list: Simplify (maybe) pop_front_and_lock()
  range_tombstone_list: De-templatize pop_as<>
  range_tombstone_list: Conceptualize erase_where()
  range_tombstone(_list): Mark some bits noexcept
  mutation: Use range_tombstone_list's iterators
  mutation_partition: Shorten memory usage calculation
  mutation_partition: Remove unused local variable
2021-09-05 17:26:13 +03:00
Dejan Mircevski
1fdaeca7d0 cql3: Reject updates with NULL key values
We were silently ignoring INSERTs with NULL values for primary-key
columns, which Cassandra rejects.  Fix it by rejecting any
modification_statement that would operate on empty partition or
clustering range.

This is the most direct fix, because range and slice are calculated in
one place for all modification statements.  It covers not only NULL
cases, but also impossible restrictions like c>0 AND c<0.
Unfortunately, Cassandra doesn't treat all modification statements
consistently, so this fix cannot fully match its behavior.  We err on
the side of tolerance, accepting some DELETE statements that Cassandra
rejects.  We add a TODO for rejecting such DELETEs later.

Fixes #7852.

Tests: unit (dev), cql-pytest against Cassandra 4.0

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9286
2021-09-05 10:23:28 +03:00
Pavel Emelyanov
d6af441eaa range_tombstone: Move linkage into range_tombstone_entry
Now it's time to remove the boost set's hook from the range_tombstone
and keep it wrapped into another class if the r._tombstone's location
is the range_tombstone_list.

Also the added previously .tombstone() getters and the _entry alias
can be removed -- all the code can work with the new class.

Two places in the code that made use of without_link{} move-constructor
are patched to get the range_tombstone part from the respective _entry
with the same result.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:45 +03:00
Pavel Emelyanov
5515f7187d range_tombstone, code: Add range_tombstone& getters
Currently all the code operates on the range_tombstone class.
and many of those places get the range tombstone in question
from the range_tombstone_list. Next patches will make that list
carry (and return) some new object called range_tombstone_entry,
so all the code that expects to see the former one there will
need to patched to get the range_tombstone from the _entry one.

This patch prepares the ground for that by introdusing the

    range_tombstone& tombstone() { return *this; }

getter on the range_tombstone itself and patching all future
users of the _entry to call .tombstone() right now.

Next patch will remove those getters together with adding the new
range_tombstone_entry object thus automatically converting all
the patched places into using the entry in a proper way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:45 +03:00
Nadav Har'El
b3f4a37a75 test/alternator: verify that nulls are valid inside string and bytes
The tests in this patch verify that null characters are valid characters
inside string and bytes (blob) attributes in Alternator. The tests
verify this for both key attributes and non-key attributes (since those
are serialized differently, it's important to check both cases).

The tests pass on both DynamoDB and Alternator - confirming that we
don't have a bug in this area.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210824163442.186881-1-nyh@scylladb.com>
2021-09-03 08:49:06 +02:00
Nadav Har'El
068c4283b7 test/cql-pytest: add tests for some undocumented cases of string types
This patch adds tests for two undocumented (as far as I can tell) corner
cases of CQL's string types:

1. The types "text" and "varchar" are not just similar - they are in
   fact exactly the same type.

2. All CQL string and blob types ("ascii", "text" or "varchar", "blob")
   allow the null character as a valid character inside them. They are
   *not* C strings that get terminated by the first null.

These tests pass on both Cassandra and Scylla, so did not expose any
bug, but having such tests is useful to understand these (so-far)
undocumented behaviors - so we can later document them.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210824225641.194146-1-nyh@scylladb.com>
2021-09-02 15:45:47 +03:00
Avi Kivity
403645f58c Merge "raft: miscellaneous fixes" from Gleb
* 'raft-misc-v3' of github.com:scylladb/scylla-dev:
  raft: rename snapshot into snapshot_descriptor
  raft: drop snapshot if is application failed
  raft: fix local snapshot detection
  raft: replication_test: store multiple snapshots in a state machine
  raft: do not wait for entry to become stable before replicate it
2021-09-02 11:25:06 +03:00
Liu Lan
a5c54867f8 alternator: Exclusive start key must lie within the segment
...when using Segment/TotalSegment option.

The requirement is not specified in DynamoDB documents, but found
in DynamoDB Local:

{"__type":"com.amazon.coral.validate#ValidationException",
"message":"Exclusive start key must lie within the segment"}

Fixes #9272

Signed-off-by: Liu Lan <liulan_yewu@cmss.chinamobile.com>

Closes #9270
2021-09-01 11:05:45 +03:00
Avi Kivity
8b59e3a0b1 Merge ' cql3: Demand ALLOW FILTERING for unlimited, sliced partitions ' from Dejan Mircevski
Return the pre- 6773563d3 behavior of demanding ALLOW FILTERING when partition slice is requested but on potentially unlimited number of partitions.  Put it on a flag defaulting to "off" for now.

Fixes #7608; see comments there for justification.

Tests: unit (debug, dev), dtest (cql_additional_test, paging_test)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9126

* github.com:scylladb/scylla:
  cql3: Demand ALLOW FILTERING for unlimited, sliced partitions
  cql3: Track warnings in prepared_statement
  test: Use ALLOW FILTERING more strictly
  cql3: Add statement_restrictions::to_string
2021-08-31 18:05:26 +03:00
Dejan Mircevski
2f28f68e84 cql3: Demand ALLOW FILTERING for unlimited, sliced partitions
When a query requests a partition slice but doesn't limit the number
of partitions, require that it also says ALLOW FILTERING.  Although
do_filter() isn't invoked for such queries, the performance can still
be unexpectedly slow, and we want to signal that to the user by
demanding they explicitly say ALLOW FILTERING.

Because we now reject queries that worked fine before, existing
applications can break.  Therefore, the behavior is controlled by a
flag currently defaulting to off.  We will default to "on" in the next
Scylla version.

Fixes #7608; see comments there for justification.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-08-31 10:45:41 -04:00
Pavel Emelyanov
e26a6c1acc btree, test: Test exception safety and non-leakness of btree::clone_from
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-31 12:23:49 +03:00
Pavel Emelyanov
da38038222 btree, test: Test key copy constructor may throw
It calls the tree_test_key_base copy constructor which
is throwing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-31 12:23:49 +03:00
Gleb Natapov
ce40b01b07 raft: rename snapshot into snapshot_descriptor
The snapshot structure does not contain the snapshot itself but only
refers to it trough its id. Rename it to snapshot_descriptor for clarity.
2021-08-29 12:53:03 +03:00
Gleb Natapov
80a392a444 raft: replication_test: store multiple snapshots in a state machine
State machine should be able to store more then one snapshot at a time
(one may be the currently used one and another is transferred from a
leader but not applied yet).
2021-08-29 12:53:03 +03:00
Gleb Natapov
5e1d589872 raft: do not wait for entry to become stable before replicate it
Since io_fiber persist entries before sending out messages even non
stable entries will become stable before observed by other nodes.

This patch also moves generation of append messages into get_outptut()
call because without the change we will lose batching since each
advance of last_idx will generate new append message.
2021-08-29 12:48:15 +03:00
Pavel Emelyanov
60a7ca62f2 storage_service: Drop .enable_all_features()
This method has nothing to do with storage service and
is only needed to move feature service options from one
method to another. This can be done by the only caller
of it.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210827133954.29535-1-xemul@scylladb.com>
2021-08-29 11:27:05 +03:00
Pavel Solodovnikov
c0854a0f62 raft: create system tables only when raft experimental feature is set
Also introduce a tiny function to return raft-enabled db config
for cql testing.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210826091432.279532-1-pa.solodovnikov@scylladb.com>
2021-08-26 12:21:12 +03:00
Avi Kivity
acf8da2bce Merge "flat_mutation_reader: keep timeout in permit" from Benny
"
This series moves the timeout parameter, that is passed to most
f_m_r methods, into the reader_permit.  This eliminates
the need to pass the timeout around, as it's taken
from the permit when needed.

The permit timeout is updated in certain cases
when the permit/reader is paused and retrieved
later on for reuse.

Following are perf_simple_query results showing ~1%
reduction in insns/op and corresponding increase in tps.

$ build/release/test/perf/perf_simple_query -c 1 --operations-per-shard 1000000 --task-quota-ms 10

Before:
102500.38 tps ( 75.1 allocs/op,  12.1 tasks/op,   45620 insns/op)

After:
103957.53 tps ( 75.1 allocs/op,  12.1 tasks/op,   45372 insns/op)

Test: unit(dev)
DTest:
    repair_additional_test.py:RepairAdditionalTest.repair_abort_test (release)
    materialized_views_test.py:TestMaterializedViews.remove_node_during_mv_insert_3_nodes_test (release)
    materialized_views_test.py:InterruptBuildProcess.interrupt_build_process_with_resharding_half_to_max_test (release)
    migration_test.py:TTLWithMigrate.big_table_with_ttls_test (release)
"

* tag 'reader_permit-timeout-v6' of github.com:bhalevy/scylla:
  flat_mutation_reader: get rid of timeout parameter
  reader_concurrency_semaphore: use permit timeout for admission
  reader_concurrency_semaphore: adjust reactivated reader timeout
  multishard_mutation_query: create_reader: validate saved reader permit
  repair: row_level: read_mutation_fragment: set reader timeout
  flat_mutation_reader: maybe_timed_out: use permit timeout
  test: sstable_datafile_test: add sstable_reader_with_timeout
  reader_permit: add timeout member
2021-08-25 17:51:10 +03:00
Avi Kivity
993f824cfd Merge "raft: implement linearisable reads on a follower" from Gleb and Kostja
"
This series implements section 6.4 of the Raft PhD. It allows to do
linearisable reads on a follower bypassing raft log entirely. After this
series server::read_barrier can be executed on a follower as well as
leader and after it completes local user's state machine state can be
accessed directly.
"

* 'raft-read-v9' of github.com:scylladb/scylla-dev:
  raft: test: add read_barrier test to replication_test
  raft: test: add read_barrier tests to fsm_test
  raft: make read_barrier work on a follower as well as on a leader
  raft: add a function to wait for an index to be applied
  raft: (server) add a helper to wait through uncertainty period
  raft: make fsm::current_leader() public
  raft: add hasher for raft::internal::tagged_uint64
  serialize: add serialized for std::monostate
  raft: fix indentation in applier_fiber
2021-08-25 13:11:35 +03:00
Gleb Natapov
3ff6f76cef raft: test: add read_barrier test to replication_test 2021-08-25 08:57:13 +03:00
Gleb Natapov
ad2c2abcb8 raft: test: add read_barrier tests to fsm_test 2021-08-25 08:57:13 +03:00
Gleb Natapov
03a266d73b raft: make read_barrier work on a follower as well as on a leader
This patch implements RAFT extension that allows to perform linearisable
reads by accessing local state machine. The extension is described
in section 6.4 of the PhD. To sum it up to perform a read barrier on
a follower it needs to asks a leader the last committed index that it
knows about. The leader must make sure that it is still a leader before
answering by communicating with a quorum. When follower gets the index
back it waits for it to be applied and by that completes read_barrier
invocation.

The patch adds three new RPC: read_barrier, read_barrier_reply and
execute_read_barrier_on_leader. The last one is the one a follower uses
to ask a leader about safe index it can read. First two are used by a
leader to communicate with a quorum.
2021-08-25 08:57:13 +03:00
Nadav Har'El
cf06b7cd40 test/alternator: correct some typos in comments
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210729125317.1610573-1-nyh@scylladb.com>
2021-08-24 19:43:29 +03:00
Avi Kivity
4a42b69ba8 Merge "raft: testing: many nodes test" from Alejo
"
Factor out replication test, make it work with different clocks, add
some features, and add a many nodes test with steady_clock. Also
refactor common test helper.

Many nodes test passes for release and dev and normal tick of 100ms for
up to 1000 servers. For debug mode it's much fewer due to lack of
optimizations so it's only tested for smaller numbers.

Tests: unit ({dev}), unit ({debug}), unit ({release})
"

* 'raft-many-22-v12' of https://github.com/alecco/scylla: (21 commits)
  raft: candidate timeout proportional to cluster size
  raft: testing: many nodes test
  raft: replication test: remove unused tick_all
  raft: replication test: delays
  raft: replication test: packet drop rpc helper
  raft: replication test: connectivity configuration
  raft: replication test: rpc network map in raft_cluster
  raft: replication test: use minimum granularity
  raft: replication test: minor: rename local to int ids
  raft: replication test: fix restart_tickers when partitioning
  raft: replication test: partition ranges
  raft: replication test: isolate one server
  raft: replication test: move objects out of header
  raft: replication test: make dummy command const
  raft: replication test: template clock type
  raft: replication test: tick delta inside raft_cluster
  raft: replication test: style - member initializer
  raft: replication test: move common code out
  raft: testing: refactor helper
  raft: log election stages
  ...
2021-08-24 17:05:05 +03:00
Benny Halevy
4476800493 flat_mutation_reader: get rid of timeout parameter
Now that the timeout is taken from the reader_permit.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 16:30:51 +03:00
Benny Halevy
4e3dcfd7d6 reader_concurrency_semaphore: use permit timeout for admission
Now that the timeout is stored in the reader
permit use it for admission rather than a timeout
parameter.

Note that evictable_reader::next_partition
currently passes db::no_timeout to
resume_or_create_reader, which propagated to
maybe_wait_readmission, but it seems to be
an oversight of the f_m_r api that doesn't
pass a timeout to next_partition().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 16:30:51 +03:00
Benny Halevy
9b0b13c450 reader_concurrency_semaphore: adjust reactivated reader timeout
Update the reader's timeout where needed
after unregistering inactive_read.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 16:30:51 +03:00
Benny Halevy
f25aabf1b2 flat_mutation_reader: maybe_timed_out: use permit timeout
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 14:29:44 +03:00
Benny Halevy
46fb7fe68e test: sstable_datafile_test: add sstable_reader_with_timeout
Verify that the sstable reader (for the highest supported version)
times out properly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 14:29:44 +03:00
Benny Halevy
fe479aca1d reader_permit: add timeout member
To replace the timeout parameter passed
to flat_mutation_reader methods.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 14:29:44 +03:00
Alejo Sanchez
a5c74a6442 raft: candidate timeout proportional to cluster size
To avoid dueling candidates with large clusters, make the timeout
proportional to the cluster size.

Debug mode is too slow for a test of 1000 nodes so it's disabled, but
the test passes for release and dev modes.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-24 13:09:01 +02:00
Alejo Sanchez
7206eae16e raft: testing: many nodes test
Tests with many nodes and realistic timers and ticks.

Network delays are kept as a fraction of ticks. (e.g. 20/100)

Tests with 600 or more nodes hang in debug mode.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-24 13:09:01 +02:00
Alejo Sanchez
87a03a3485 raft: replication test: remove unused tick_all
Tests now wait for normal ticks for election, remove deprecated tick_all
helper.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-24 13:09:01 +02:00
Alejo Sanchez
14c214d73e raft: replication test: delays
Allow test supplied delays for rpc communication.

Allow supplying network delay, local delay (nodes within the same
server), how many nodes are local, and an extra small delay simulating
local load.

Modify rpc class to support delays. If delays are enabled, it no longer
directly calls the other node's server code but it schedules it to be
called later. This makes the test more realistic as in the previous
version the first candidate was always going to get to all followers
first, preventing a dueling candidates scenario.

Previously, tickers were all scheduled at the same time, so there was no
spread of them across the tick time. Now these tickers are scheduled
with a uniform spread across this time (tick delta).

Also previously, for custom free elections used tick_all() which
traversed _in_configuration sequentially and ticked each. This, combined
with rpc outbound directly calling methods in the other server without
yielding, caused free elections to be unrealistic with same order
determined and first candidate always winning. This patch changes this
behavior. The free election uses normal tickers (now uniformly
distributed in tick delay time) and its loop waits for tick delay time
(yielding) and checks if there's a new leader. Also note the order might
not be the same in debug mode if more than one tick is scheduled.

As rpc messages are sent delayed, network connectivity needs to be
checked again before calling the function on the remote side.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-24 13:05:53 +02:00
Alejo Sanchez
db23823c77 raft: replication test: packet drop rpc helper
Add a helper to check if a packet should be dropped.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
497af3167f raft: replication test: connectivity configuration
Pass packet drops within connectivity configuration struct.
Default to no packet drops.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
e4d5428e8a raft: replication test: rpc network map in raft_cluster
Move rpc network map to raft cluster, no longer as static in rpc class.
2021-08-23 17:50:16 +02:00
Alejo Sanchez
192ac5be4c raft: replication test: use minimum granularity
seastar lowres_clock minimum granularity is 10ms, not 1ms.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
5cfe6c1ca2 raft: replication test: minor: rename local to int ids
For clarity, name 0-based integer ids as int ids not local.
This is in contrast with 1-based UUID ids.
2021-08-23 17:50:16 +02:00
Alejo Sanchez
27d90f0165 raft: replication test: fix restart_tickers when partitioning
When partitioning, elect_new_leader restarts tickers, so don't
re-restart them in this case.

When leader is dropped and no new leader is specified, restart tickers
before free election.

If no change of leader, restart tickers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
e4262291f2 raft: replication test: partition ranges
Allow specifying ranges within partition to handle large number of
nodes.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
56a110d42f raft: replication test: isolate one server
Support disconnection of one server with the rest.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
6b3327c753 raft: replication test: move objects out of header
Use a separate cc file for definitions and objects.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
cea18e6830 raft: replication test: make dummy command const
Make dummy command const in header.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
2db3192ac3 raft: replication test: template clock type
Templetize clock type.

Use a struct for run_test to work around
https://bugs.llvm.org/show_bug.cgi?id=50345

With help from @kbr-

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
cb35588fb1 raft: replication test: tick delta inside raft_cluster
Store tick delta inside raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
49cb040037 raft: replication test: style - member initializer
Fix raft_cluster constructor member initializer list.
2021-08-23 17:50:16 +02:00
Alejo Sanchez
6e2ab657b3 raft: replication test: move common code out
Common replication test code moved to header.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
a6cd35c512 raft: testing: refactor helper
Move definitions to helper object file.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00