A short new test to verify that in the TagResource operation, the
Tags parameter - specifying which tags to set - is required.
The test passes on both AWS and Alternator - they both produce a
ValidationException in this case (the specific human-readable error
message is different, though, so we don't check it).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211206140541.1157574-1-nyh@scylladb.com>
For each quarantine mode:
Validate sstables to quarantine one of them
and then scrub with the given quarantine mode
and verify the output whwther the quarantined
sstable was scrubbed or not.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When invalid sstables are detected, move them
to the quarantine subdirectory so they won't be
selected for regular compaction.
Refs #7658
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Test that we load quarantined sstables by
creating a dataset, moving a sstable to the quarantine dir,
and then reload the table and verify the dataset.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define the "staging", "upload", and "snapshots" subdirectory
names as named const expressions in the sstables namespace
rather than relying on their string representation,
that could lead to typo mistakes.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Scylla doesn't support unset values inside UDT.
The old code used to convert `unset` to `null`, which seems incorrect.
There is an extra space in the error message to retain compatability with Cassandra.
Fixes: #9671Closes#9724
* github.com:scylladb/scylla:
cql-pytest: Enable test for UDT with unset values
cql3: Don't allow unset values inside UDT
The test testUDTWithUnsetValues was marked as xfail,
but now the issue has been fixed and we can enable it.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
"
couple of preparatory changes for coroutinization of manager
"
* 'some_compaction_manager_cleanups_v5' of github.com:raphaelsc/scylla:
compaction_manager: move check_for_cleanup into perform_cleanup()
compaction_manager: replace get_total_size by one liner
compaction_manager: make consistent usage of type and name table
compaction_manager: simplify rewrite_sstables()
compaction_manager: restore indentation
Add the summary index and the bound's address to the error message, so
it can be correlated with other trace level logging when investigating a
problem.
Refs: #9446
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211202124955.542293-2-bdenes@scylladb.com>
new code in manager adopted name and type table, whereas historical
code still uses name and type column family. let's make it consistent
for newcomers to not get confused.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We read the compressed file size from a file that was already closed,
resulting in EBADF on my machine. Not sure why it works for everyone
else.
Fix by reading the size using the path.
Closes#9675
This series introduces a new version of a loading_cache class.
The old implementation was susceptible to a "pollution" phenomena when frequently used entry can get evicted by an intensive burst of "used once" entries pushed into the cache.
The new version is going to have a privileged and unprivileged cache sections and there's a new loading_cache template parameter - SectionHitThreshold. The new cache algorithm goes as follows:
* We define 2 dynamic cache sections which total size should not exceed the maximum cache size.
* New cache entry is always added to the "unprivileged" section.
* After a cache entry is read more than SectionHitThreshold times it moves to the second cache section.
* Both sections' entries obey expiration and reload rules in the same way as before this patch.
* When cache entries need to be evicted due to a size restriction "unprivileged" section's
least recently used entries are evicted first.
More details may be found in #8674.
In addition, during a testing another issue was found in the authorized_prepared_statements_cache: #9590.
There is a patch that fixes it as well.
Closes#9708
* github.com:scylladb/scylla:
loading_cache: account unprivileged section evictions
loading_cache: implement a variation of least frequent recently used (LFRU) eviction policy
authorized_prepared_statements_cache: always "touch" a corresponding cache entry when accessed
loading_cache::timestamped::lru_entry: refactoring
loading_cache.hh: rearrange the code (no functional change)
loading_cache: use std::pmr::polymorphic_allocator
This series extends `compaction_manager::stop_ongoing_compaction` so it can be used from the api layer for:
- table::disable_auto_compaction
- compaction_manager::stop_compaction
Fixes#9313Fixes#9695
Test: unit(dev)
Closes#9699
* github.com:scylladb/scylla:
compaction_manager: stop_compaction: wait for ongoing compactions to stop
compaction_manager: stop_ongoing_compactions: log Stopping 0 tasks at debug level
compaction_manager: unify stop_ongoing_compactions implementations
compaction_manager: stop_ongoing_compactions: add compaction_type option
compaction_manager: get_compactions: get a table* parameter
table: disable_auto_compaction: stop ongoing compactions
compaction_manager: make stop_ongoing_compactions public
table: futurize disable_auto_compactions
This PR finally removes the `term` class and replaces it with `expression`.
* There was some trouble with `lwt_cache_id` in `expr::function_call`.
The current code works the following way:
* for each `function_call` inside a `term` that describes a pk restriction, `prepare_context::add_pk_function_call` is called.
* `add_pk_function_call` takes a `::shared_ptr<cql3::functions::function_call>`, sets its `cache_id` and pushes this shared pointer onto a vector of all collected function calls
* Later when some condiition is met we want to clear cache ids of all those collected function calls. To do this we iterate through shared pointers collected in `prepare_context` and clear cache id for each of them.
This doesn't work with `expr::function_call` because it isn't kept inside a shared pointer.
To solve this I put the `lwt_cache_id` inside a shared pointer and then `prepare_context` collects these shared pointers to cache ids.
I also experimented with doing this without any shared pointers, maybe we could just walk through the expression and clear the cache ids ourselves. But the problem is that expressions are copied all the time, we could clear the cache in one place, but forget about a copy. Doing it using shared pointers more closely matches the original behaviour.
The experiment is on the [term2-pr3-backup-altcache](https://github.com/cvybhu/scylla/tree/term2-pr3-backup-altcache) branch
* `shared_ptr<term>` being `nullptr` could mean:
* It represents a cql value `null`
* That there is no value, like `std::nullopt` (for example in `attributes.hh`)
* That it's a mistake, it shouldn't be possible
A good way to distinguish between optional and mistake is to look for `my_term->bind_and_get()`, we then know that it's not an optional value.
* On the other hand `raw_value` cased to bool means:
* `false` - null or unset
* `true` - some value, maybe empty
I ran a simple benchmark on my laptop to see how performance is affected:
```
build/release/test/perf/perf_simple_query --smp 1 -m 1G --operations-per-shard 1000000 --task-quota-ms 10
```
* On master (a21b1fbb2f) I get:
```
176506.60 tps ( 77.0 allocs/op, 12.0 tasks/op, 45831 insns/op)
median 176506.60 tps ( 77.0 allocs/op, 12.0 tasks/op, 45831 insns/op)
median absolute deviation: 0.00
maximum: 176506.60
minimum: 176506.60
```
* On this branch I get:
```
172225.30 tps ( 75.1 allocs/op, 12.1 tasks/op, 46106 insns/op)
median 172225.30 tps ( 75.1 allocs/op, 12.1 tasks/op, 46106 insns/op)
median absolute deviation: 0.00
maximum: 172225.30
minimum: 172225.30
```
Closes#9481
* github.com:scylladb/scylla:
cql3: Remove remaining mentions of term
cql3: Remove term
cql3: Rename prepare_term to prepare_expression
cql3: Make prepare_term return an expression instead of term
cql3: expr: Add size check to evaluate_set
cql3: expr: Add expr::contains_bind_marker
cql3: expr: Rename find_atom to find_binop
cql3: expr: Add find_in_expression
cql3: Remove term in operations
cql3: Remove term in relations
cql3: Remove term in multi_column_restrictions
cql3: Remove term in term_slice, rename to bounds_slice
cql3: expr: Remove term in expression
cql3: expr: Add evaluate_IN_list(expression, options)
cql3: Remove term in column_condition
cql3: Remove term in select_statement
cql3: Remove term in update_statement
cql3: Use internal cql format in insert_prepared_json_statement cache
types: Add map_type_impl::serialize(range of <bytes, bytes>)
cql3: Remove term in cql3/attributes
cql3: expr: Add constant::view() method
cql3: expr: Implement fill_prepare_context(expression)
cql3: expr: add expr::visit that takes a mutable expression
cql3: expr: Add receiver to expr::bind_variable
"
To ensure consistency of schema and topology changes,
Scylla needs a linearizable storage for this data
available at every member of the database cluster.
The series introduces such storage as a service,
available to all Scylla subsystems. Using this service, any other
internal service such as gossip or migrations (schema) could
persist changes to cluster metadata and expect this to be done in
a consistent, linearizable way.
The series uses the built-in Raft library to implement a
dedicated Raft group, running on shard 0, which includes all
members of the cluster (group 0), adds hooks to topology change
events, such as adding or removing nodes of the cluster, to update
group 0 membership, ensures the group is started when the
server boots.
The state machine for the group, i.e. the actual storage
for cluster-wide information still remains a stub. Extending
it to actually persist changes of schema or token ring
is subject to a subsequent series.
Another Raft related service was implemented earlier: Raft Group
Registry. The purpose of the registry is to allow Scylla have an
arbitrary number of groups, each with its own subset of cluster
members and a relevant state machine, sharing a common transport.
Group 0 is one (the first) group among many.
"
* 'raft-group-0-v12' of github.com:scylladb/scylla-dev:
raft: (server) improve tracing
raft: (metrics) fix spelling of waiters_awaken
raft: make forwarding optional
raft: (service) manage Raft configuration during topology changes
raft: (service) break a dependency loop
raft: (discovery) introduce leader discovery state machine
system_keyspace: mark scylla_local table as always-sync commitlog
system_keyspace: persistence for Raft Group 0 id and Raft Server Id
raft: add a test case for adding entries on follower
raft: (server) allow adding entries/modify config on a follower
raft: (test) replace virtual with override in derived class
raft: (server) fix a typo in exception message
raft: (server) implement id() helper
raft: (server) remove apply_dummy_entry()
raft: (test) fix missing initialization in generator.hh
In order to avoid race condition introduced in 9dce1e4 the
index_reader should be closed prior to it's destruction.
This only exposes 4.4 and earlier releases to this specific race.
However, it is always a good idea to first close the index reader
and only then destroy it since it is most likely to be assumed by
all developers that will change the reader index in the future.
Ref #9704 (because on 4.4 and earlier releases are vulnerable).
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Closes#9705
This patch implements a simple variation of LFRU eviction policy:
* We define 2 dynamic cache sections which total size should not exceed the maximum cache size.
* New cache entry is always added to the "unprivileged" section.
* After a cache entry is read more than SectionHitThreshold times it moves to the second cache section.
* Both sections' entries obey expiration and reload rules in the same way as before this patch.
* When cache entries need to be evicted due to a size restriction "unprivileged" section's
least recently used entries are evicted first.
Note:
With a 2 sections cache it's not enough for a new entry to have the latest timestamp
in order not be evicted right after insertion: e.g. if all all other entries
are from the privileged section.
And obviously we want to allow new cache entries to be added to a cache.
Therefore we can no longer first add a new entry and then shrink the cache.
Switching the order of these two operations resolves the culprit.
Fixes#8674
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
We already have tests for the behavior of the "Select" parameter when
querying a base table, but this patch adds additional tests for its
behavior when querying a GSI or a LSI. There are some differences:
Select=ALL_PROJECTED_ATTRIBUTES is not allowed for base tables, but is
allowed - and in fact is the default - for GSI and LSI. Also, GSI may
not allow ALL_ATTRIBUTES (which is the default for base tables) if
only a subset of the attributes were projected.
The new tests xfail because the Select and Projection features have
not yet been implemented in Alternator. They pass in DynamoDB.
After this patch we have (hopefully) complete test coverage of the
Select feature, which will be helpful when we start implementing it.
Refs #5058 (Select)
Refs #5036 (Projection)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211125100443.746917-1-nyh@scylladb.com>
Add to the existing tests for the Select parameter of the Query and Scan
operations another check: That when Select is ALL_ATTRIBUTES or COUNT,
specifying AttributesToGet or ProjectionExpression is forbidden -
because the combination doesn't make sense.
The expanded test continues to xfail on Alternator (because the Select
parameter isn't yet implemented), and passes on DynamoDB. Strengthening
the tests for this feature will be helpful when we decide to implement it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211125074128.741677-1-nyh@scylladb.com>
Most of the machinery was already implemented since it was used when
jumping between clustering ranges of a query slice. We need only perform
one additional thing when performing an index skip during
fast-forwarding: reset the stored range tombstone in the consumer (which
may only be stored in fast-forwarding mode, so it didn't matter that it
wasn't reset earlier). Comments were added to explain the details.
As a preparation for the change, we extend the sstable reversing reader
random schema test with a fast-forwarding test and include some minor
fixes.
Fixes#9427.
Closes#9484
* github.com:scylladb/scylla:
query-request: add comment about clustering ranges with non-full prefix key bounds
sstables: mx: enable position fast-forwarding in reverse mode
test: sstable_conforms_to_mutation_source_test: extend `test_sstable_reversing_reader_random_schema` with fast-forwarding
test: sstable_conforms_to_mutation_source_test: fix `vector::erase` call
test: mutation_source_test: extract `forwardable_reader_to_mutation` function
test: random_schema: fix clustering column printing in `random_schema::cql`
When memtable contains both mutations and tombstones that delete them,
the output flushed to sstables contains both mutations. Inserting a
compacting reader results in writing smaller sstables and saves
compaction work later.
Performance tests of this change have shown a regression in a common
case where there are no deletes. A heuristic is employed to skip
compaction unless there are tombstones in the memtable to minimise
the impact of that issue.
The test would check whether the forward and reverse readers returned
consistent results when created in non-forwarding mode with slicing.
Do the same but using fast-forwarding instead of slicing.
To do this we require a vector of `position_range`s. We also need a
vector of `clustering_range`s for the existing test. We modify the
existing `random_ranges` function to return `position_range`s instead of
`clustering_range`s since `position_range`s are easier to reason about,
especially when we consider non-full clustering key prefixes. A function
is introduced to convert a `position_range` to a `clustering_range` for
the existing test.
This series hardens compaction_manager::remove by:
- add debug logging around task execution and stopping.
- access compaction_state as lw_shared_ptr rather than via a raw pointer.
- with that, detach it from `_compaction_state` in `compaction_manager::remove` right away, to prevent further use of it while compactions are stopped.
- added write_lock in `remove` to make sure the lock is not held by any stray task.
Test: unit(dev), sstable_compaction_test(debug)
Dtest: alternator_tests.py:AlternatorTest.test_slow_query_logging (debug)
Closes#9636
* github.com:scylladb/scylla:
compaction_manager: add compaction_state when table is constructed
compaction_manager: remove: fixup indentation
compaction_manager: remove: detach compaction_state before stopping ongoing compactions
compaction_manager: remove: serialize stop_ongoing_compactions and gate.close
compaction_manager: task: keep a reference on compaction_state
test: sstable_compaction_test: incremental_compaction_data_resurrection_test: stop table before it's destroyed.
test: sstable_utils: compact_sstables: deregister compaction also on error path
test: sstable_compaction_test: partial_sstable_run_filtered_out_test: deregiser_compaction also on error path
test: compaction_manager_test: add debug logging to register/deregister compaction
test: compaction_manager_test: deregister_compaction: erase by iterator
test: compaction_manager_test: move methods out of line
compaction_manager: compaction_state: use counter for compaction_disabled
compaction_manager: task: delete move and copy constructors
compaction_manager: add per-task debug log messages
compaction_manager: stop_ongoing_compactions: log number of tasks to stop
In this patch series we add an implementation of an
expiration service to Alternator, which periodically scans the data in
the table, looking for expired items and deleting them.
We also continue to improve the TTL test suite to cover additional
corner cases discovered during the development of the code.
This implementation is good enough to make all existing tests but one,
plus a few new ones, pass, but is still a very partial and inefficient
implementation littered with FIXMEs throughout the code. Among other
things, this initial implementation doesn't do anything reasonable about pacing of
the scan or about multiple tables, it scans entire items instead of only the
needed parts, and because each shard "owns" a different subset of the
token ranges, if a node goes down, partitions which it "owns" will not
get expired.
The current tests cannot expose these problems, so we will need to develop
additional tests for them.
Because this implementation is very partial, the Alternator TTL continues
to remain "experimental", cannot be used without explicitly enabling this
experimental feature, and must not be used for any important deployment.
Refs #5060 but doesn't close the issue (let's not close it until we have a
reasonably complete implementation - not this partial one).
Closes#9624
* github.com:scylladb/scylla:
alternator: fix TTL expiration scanner's handling of floating point
test/alternator: add TTL test for more data
test/alternator: remove "xfail" tag from passing tests in test_ttl.py
test/alternator: make test_ttl.py tests fast on Alternator
alternator: initial implmentation of TTL expiration service
alternator: add another unwrap_number() variant
alternator: add find_tag() function
test/alternator: test another corner case of TTL setting
test/alternator: test TTL expiration for table with sort key
test/alternator: improve basic test for TTL expiration
test/alternator: extract is_aws() function
"
Reverse queries has to use the reverse schema (query schema) for the
read itself but the table schema for the result building, according to
the established interface with the coordinator (half-reverse format).
Range scans were using the query schema for both, which produced
un-parseable reconcilable results for mutation range scans.
This series fixes this and adds unit tests to cover this previously
uncovered area.
"
Fixes#9673.
* 'reverse-range-scan-test/v1' of https://github.com/denesb/scylla:
test/boost/multishard_mutation_query_test: add reverse read test
test/boost/multishard_mutation_query_test: add test for combinations of limits, paging and stateful
test/boost/multishard_mutation_query_test: generalize read_partitions_with_paged_scan()
test/boost/multishard_mutation_query_test: add read_all_partitions_one_by_one() overload with slice
multishard_mutation_query: fix reverse scans
partition_slice: init all fields in copy ctor
partition_slice: operator<<: print the entire partition row limit
partition_slice_builder: add with_partition_row_limit()
"
This set covers simple but diverse cases:
- cache hitrace calculator
- repair
- system keyspace (virtual table)
- dht code
- transport event notifier
All the places just require straightforward arguments passing.
And a reparation in transport -- event notifier needs a backref
to the owning server.
Remaining after this set is the snitch<->gossiper interaction
and the cache hitrate app state update from table code.
tests: unit(dev)
"
* 'br-unglobal-gossiper-cont' of https://github.com/xemul/scylla:
transport: Use server gossiper in event notifier
transport: Keep backreference from event_notifier
transport: Keep gossiper on server
dht: Pass gossiper to range_streamer::add_ranges
dht: Pass gossiper argument to bootstrap
system_keyspace: Keep gossiper on cluster_status_table
code: Carry gossiper down to virtual tables creation
repair: Use local gossiper reference
cache_hitrate_calculator: Keep reference on gossiper
The expiration-time attribute used by Alternator's TTL feature has a
numeric type, meaning that it may be a floating point number - not just
an integer, and implemented as big_decimal which has a separate integer
mantissa and exponent. Our code which checked expiration incorrectly
looked only at the mantissa - resulting in incorrect handling of
expiration times which have a fractional part - 123.4 was treated as
1234 instead of 123.
This patch fixes the big_decimal handling in the expiration checking,
and also adds to the test test_ttl.py::test_ttl_expiration check also
for non-integer floating point as well as one with an exponent. The
new tests pass on DynamoDB, and failed on Alternator before this patch -
and pass with it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The existing TTL tests use only tiny tables, so don't exercise the
expiration-time scanner's use of paging. So in this patch we add
another test with a much larger table (with 40,000 items).
To verify that this test indeed checks paging, I stopped the scanner's
iteration after one page, and saw that this test starts failing (but
the smaller tests all pass).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Most tests in test_ttl.py now pass, so remove their "xfail" tag.
The only remaining failing test is test_ttl_expiration_streams -
which cannot yet pass because the expiration event is not yet marked.
Note that the fact that almost all tests for Alternator's TTL feature
now pass does not mean the feature is complete. The current implementation
is very partial and inefficient, and only works reasonably in tests on
a single node. The current tests cannot expose these problems, so
we will need to develop additional tests for them. The tests will of
course remain useful to see that as the implementation continues to
improve, none of the tests that already work will break.
The Alternator TTL continues to remain "experimental", cannot be used
without explicitly enabling this experimental feature, and must not be
used for any important deployment.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The tests for the TTL feature in test/alternator/test_ttl.py takes huge
amount of time on DynamoDB - 10 to 30 minutes (!) - because it delays
expiration of items a long time after their intended expiration times.
We intend Scylla's implementation to have a configurable delay for the
expiration scanner, which we will be able to configure to very short
delays for tests. So These tests can be made much faster on Scylla.
So in this patch we change all of the tests to finish much more quickly
on Scylla.
Many of the tests still fail, because the TTL feature is not implemented
yet.
Although after this change all the tests in test_ttl.py complete in
a reasonable amount of time (around 3 seconds each), we still mark
them as "veryslow" and the "--runveryslow" flag is needed to run them.
We should consider changing this in the future, so that these tests will
run as part of our default test suite.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Although it isn't terribly useful, an Alternator user can enable TTL
with an expiration-time attribute set to a *key* attribute. Because
expiration times should be numeric - not other types like strings -
DynamoDB could warn the user when a chosen key attribute hs a non-
numeric type (since key attributes do have fixed types!). But DynamoDB
doesn't warn about this - it simply expires nothing. This test
verifies this that it indeed does this.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The basic test for TTL expiration, test_ttl.py::test_ttl_expiration,
uses a table with only a partition key. Most of the item expiration
logic is exactly the same for tables that also have a sort key, but
the step of *deleting* the item is different, so let's add a test
that verifies that also in this case, the expired item is properly
deleted.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch improves test_ttl.py::test_ttl_expiration in two ways:
First, it checks yet another case - that items that have the wrong type
for the expiration-time column (e.g., a string) never get expired - even
if that string happens to contain a number that looks like an expiration
time.
Second, instead of the huge 15-minute duration for this test, the
test now has a configurable duration; We still need to use a very long
duration on AWS, but in Scylla we expect to be able to configure the
TTL scan frequency, and can finish this test in just a few seconds!
We already have experimental code which makes this test pass in just
3 seconds.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Extract a boolean function is_aws() out of the "scylla_only" fixture, so
it can be used in tests for other purposes.
For example, in the next patch the TTL tests will use them to pick
different timeouts on AWS (where TTL expiration have huge many-minute
delays) and on Scylla (which can be configured to have very short delays).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In absence of abort_source or timeouts in Raft API, automatic
bouncing can create too much noise during testing, especially
during network failures. Add an option to disable follower
bouncing feature, since randomized_nemesis_test has its own
bouncing which handles timeouts correctly.
Optionally disable forwarding in basic_generator_test.
Operations of adding or removing a node to Raft configuration
are made idempotent: they do nothing if already done, and
they are safe to resume after a failure.
However, since topology changes are not transactional, if a
bootstrap or removal procedure fails midway, Raft group 0
configuration may go out of sync with topology state as seen by
gossip.
In future we must change gossip to avoid making any persistent
changes to the cluster: all changes to persistent topology state
will be done exclusively through Raft Group 0.
Specifically, instead of persisting the tokens by advertising
them through gossip, the bootstrap will commit a change to a system
table using Raft group 0. nodetool will switch from looking at
gossip-managed tables to consulting with Raft Group 0 configuration
or Raft-managed tables.
Once this transformation is done, naturally, adding a node to Raft
configuration (perhaps as a non-voting member at first) will become the
first persistent change to ring state applied when a node joins;
removing a node from the Raft Group 0 configuration will become the last
action when removing a node.
Until this is done, do our best to avoid a cluster state when
a removed node or a node which addition failed is stuck in Raft
configuration, but the node is no longer present in gossip-managed
system tables. In other words, keep the gossip the primary source of
truth. For this purpose, carefully chose the timing when we
join and leave Raft group 0:
Join the Raft group 0 only after we've advertised our tokens, so the
cluster is aware of this node, it's visible in nodetool status,
but before node state jumps to "normal", i.e. before it accepts
queries. Since the operation is idempotent, invoke it on each
restart.
Remove the node from Group 0 *before* its tokens are removed
from gossip-managed system tables. This guarantees
that if removal from Raft group 0 fails for whatever reason,
the node stays in the ring, so nodetool removenode and
friends are re-tried.
Add tracing.
Introduce a special state machine used to to find
a leader of an existing Raft cluster or create
a new cluster.
This state machine should be used when a new
Scylla node has no persisted Raft Group 0 configuration.
The algorithm is initialized with a list of seed
IP addresses, IP address of this server, and,
this server's Raft server id.
The IP addresses are used to construct an initial list of peers.
Then, the algorithm tries to contact each peer (excluding self) from
its peer list and share the peer list with this peer, as well as
get the peer's peer list. If this peer is already part of
some Raft cluster, this information is also shared. On a response
from a peer, the current peer's peer list is updated. The
algorithm stops when all peers have exchanged peer information or
one of the peers responds with id of a Raft group and Raft
server address of the group leader.
(If any of the peers fails to respond, the algorithm re-tries
ad infinitum with a timeout).
More formally, the algorithm stops when one of the following is true:
- it finds an instance with initialized Raft Group 0, with a leader
- all the peers have been contacted, and this server's
Raft server id is the smallest among all contacted peers.
Implement an RPC to forward add_entry calls from the follower
to leader. Bounce & retry in case of not_a_leader.
Do not retry in case of uncertainty - this can lead to adding
duplicate entries.
The feature is added to core Raft since it's needed by
all current clients - both topology and schema changes.
When forwarding an entry to a remote leader we may get back
a term/index pair that conflicts (has the same index, but is with
a higher term) with a local entry we're still waiting on.
This can happen, e.g. because there was a leader change and the
log was truncated, but we still haven't got the append_entries
RPC from the new leader, still haven't truncated the log locally,
still haven't aborted all the local waits for truncated entries.
Only remove the offending entry from the wait list and abort it.
There may be entries labeled with an older term to the right (with
higher commit index) of the conflicting entry. However, finding them,
would require a linear scan. If we allow it, we may end up doing this
linear scan for *every* conflicting entry during the transition
period, which brings us to N^2 complexity of this step. At the
same time, as soon as append_entries that commits a higher-term
entry with the same index reaches the follower, the waits
for the respective truncated entry will be aborted anyway (see
notify_waiters() which sets dropped_entry exception), so the scan
is unnecessary.
Similarly to being able to add entries, allow to modify
Raft group configuration on a follower. The implementation
works the same way as adding entries - forwards the command
to the leader.
Now that add_entry() or modify_config never throws not_a_leader,
it's more likely to throw timed_out_error, e.g. in case the
network is partitioned. Previously it was only possible due to a
semaphore wait timeout, and this scenario was not tested.
Handle timed_out_error on RPC level to let the existing tests
(specifically the randomized nemesis test) pass.
A missing initialization in poll_timeout of class interpreter
could manifest itself as a sporadically failing
randomized_nemesis_test.
The test would prematurely run out of allowed limit of virtual
clock ticks.
One of the tables needs gossiper and uses global one. This patch
prepares the fix by patching the main -> register_virtual_tables
stack with the gossiper reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch adds two more tests for the unimplemented Select=COUNT
feature (which asks to only count queried items and not return the
actual items). Because this feature has not yet been implemented in
Alternator (Refs #5058), the new tests xfail. They pass on DynamoDB.
The two tests added here are for the interaction of the Select=COUNT
feature with filters - in one of the two supported syntaxes (QueryFilter
and FilterExpression). We want to verify that even though the user
doesn't need the content of the items (since only the counts were
requested), they are still retrieved from disk as needed for doing
proper filtering - but not returned.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211124225429.739744-1-nyh@scylladb.com>
Query time must be fetched after populate. If compaction is executed
during populate it may be executed with timestamp later than query_time.
This would cause the test expected compaction and compaction during
populate to be executed at different time points producing different
results. The result would be sporadic test failures depending on relative
timing of those operations. If no other mutations happen after populate,
and query_time is later than the compaction time during population, we're
guaranteed to have the same results.
Message-Id: <20211123134808.105068-1-mikolaj.sieluzycki@scylladb.com>