Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stem from the fact that resulting reconcilable_result will be large:
1. Large allocations. Serialization of reconcilable_result causes large allocations for storing result rows in std::deque
2. Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s
3. Too large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs https://github.com/scylladb/scylladb/issues/9111.
This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows.
This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do.
My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition):
* Node1: 1 live row, 1M dead rows
* Node2: 1M dead rows, 1 live row
This was designed to trigger reconciliation right from the very start of the query.
Before:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```
After:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```
Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before).
Refs https://github.com/scylladb/scylladb/issues/7929
Refs https://github.com/scylladb/scylladb/issues/3672
Refs https://github.com/scylladb/scylladb/issues/7933
Fixes https://github.com/scylladb/scylladb/issues/9111Closesscylladb/scylladb#15414
* github.com:scylladb/scylladb:
test/topology_custom: add test_read_repair.py
replica/mutation_dump: detect end-of-page in range-scans
tools/scylla-sstable: write: abort parser thread if writing fails
test/pylib: add REST methods to get node exe and workdir paths
test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction}
service/storage_proxy: add trace points for the actual read executor type
service/storage_proxy: add trace points for read-repair
storage_proxy: Add more trace-level logging to read-repair
database: Fix accounting of small partitions in mutation query
database, storage_proxy: Reconcile pages with no live rows incrementally
For JSON objects represented as map<ascii, int>, don't treat ASCII keys
as a nested JSON string. We were doing that prior to the patch, which
led to parsing errors.
Included the error offset where JSON parsing failed for
rjson::parse related functions to help identify parsing errors
better.
Fixes: #7949
Signed-off-by: Michael Huang <michaelhly@gmail.com>
Closesscylladb/scylladb#15499
Consider the following code snippet:
```c++
future<> foo() {
semaphore.consume(1024);
}
future<> bar() {
return _allocating_section([&] {
foo();
});
}
```
If the consumed memory triggers the OOM kill limit, the semaphore will throw `std::bad_alloc`. The allocating section will catch this, bump std reserves and retry the lambda. Bumping the reserves will not do anything to prevent the next call to `consume()` from triggering the kill limit. So this cycle will repeat until std reserves are so large that ensuring the reserve fails. At this point LSA gives up and re-throws the `std::bad_alloc`. Beyond the useless time spent on code that is doomed to fail, this also results in expensive LSA compaction and eviction of the cache (while trying to ensure reserves).
Prevent this situation by throwing a distinct exception type which is derived from `std::bad_alloc`. Allocating section will not retry on seeing this exception.
A test reproducing the bug is also added.
Fixes: #15278Closesscylladb/scylladb#15581
* github.com:scylladb/scylladb:
test/boost/row_cache_test: add test_cache_reader_semaphore_oom_kill
utils/logalloc: handle utils::memory_limit_reached in with_reclaiming_disabled()
reader_concurrency_semaphore: use utils::memory_limit_reached exception
utils: add memory_limit_reached exception
We want to make table::_truncated_at optional, so that in
get_truncation_time we can assert that it is initialized.
For existing tables this initialisation will happen in
load_truncation_times function, and for new tables we
want to initialize it in add_column_family like we do
with mark_ready_for_writes.
Now add_column_family function has parameter 'readonly', which is
set by the callers to false if we are creating a fresh new table
and not loading it from sstables. In this commit we rename this
parameter to is_new and invert the passed values.
This will allow us in the next commit to initialize _truncated_at field
for new tables.
All `sstable_set_impl` subclasses/implementations already keep a `schema_ptr` so we can make `sstable_set_impl::make_incremental_selector` function return both the selector and the schema that's being used by it.
That way, we can use the returned schema in `sstable_set::make_incremental_selector` function instead of `sstable_set::_schema` field which makes the field unused and allows us to remove it alltogether and reduce the memory footprint of `sstable_set` objects.
Closesscylladb/scylladb#15570
* github.com:scylladb/scylladb:
sstable_set: Remove unused _schema field
sstable_set_impl: Return also schema from make_incremental_selector
Also rephase the messages a bit so they are more uniform.
The goal of this change is to make semaphore mismatches easier to
diagnose, by including the table name and the permit name in the
printout.
While at it, add a test for semaphore mismatch, it didn't have one.
Refs: #15485Closesscylladb/scylladb#15508
Tracing is one of two global service left out there with its starting and stopping being pretty hairy. In order to de-globalize it and keep its start-stop under control the existing start-stop sequence is worth cleaning. This PR
* removes create_ , start_ and stop_ wrappers to un-hide the global tracing_instance thing
* renames tracing::stop() to shutdown() as it's in fact shutdown
* coroutinizes start/shutdown/stop while at it
Squeezed parts from #14156 that don't reorder start-stop calls
Closesscylladb/scylladb#15611
* github.com:scylladb/scylladb:
main: Capture local tracing reference to stop tracing
tracing: Pack testing code
tracing: Remove stop_tracing() wrapper
tracing: Remove start_tracing() wrapper
tracing: Remove create_tracing() wrapper
tracing: Make shutdown() re-entrable
tracing: Coroutinize start/shutdown/stop
tracing: Rename helper's stop() to shutdown()
The test does (among other things) the following:
1. Create a cache reader with buffer of size 1 and fill the buffer.
2. Update the cache.
3. Check that the reader produces the first mutation as seen before
the update (because the buffer fill should have snapshotted the first
mutation), and produces other mutation as seen after the update.
However, the test is not guaranteed to stop after the update succeeds.
Even during a successful update, an allocation might have failed
(and been retried by an allocation_section), which will cause the
body of with_allocation_failures to run again. On subsequent runs
the last check (the "3." above) fails, because the first mutation
is snapshotted already with the new version.
Fix that.
Closesscylladb/scylladb#15634
The following new commands are implemented:
* disablebackup
* disablebinary
* disablegossip
* enablebackup
* enablebinary
* enablegossip
* gettraceprobability
* help
* settraceprobability
* statusbackup
* statusbinary
* statusgossip
* version
All are associated with tests. All tests (both old and new) pass with both the scylla-native and the cassandra nodetool implementation.
Refs: https://github.com/scylladb/scylladb/issues/15588Closesscylladb/scylladb#15593
* github.com:scylladb/scylladb:
tools/scylla-nodetool: implement help operation
tools/scylla-nodetool: implement the traceprobability commands
tools/scylla-nodetool: implement the gossip commands
tools/scylla-nodetool: implement the binary commands
tools/scylla-nodetool: implement backup related commands
tools/scylla-nodetool: implement version command
test/nodetool: introduce utils.check_nodetool_fails_with()
test/nodetool: return stdout of nodetool invokation
test/nodetool/rest_api_mock.py: fix request param matching
tools/scylla-nodetool: compact: remove --partition argument
tools/scylla-nodetool: scylla_rest_client: add support delete method
tools/scylla-nodetool: get rid of check_json_type()
tools/scylla-nodetool: log more details for failed requests
tools/scylla-*: use operation_option for positional options
tools/utils: add support for operation aliases
Checking that nodetool fails with a given message turned out to be a
common pattern, so extract the logic for checking this into a method of
its own. Refactor the existing tests to use it, instead of the
hand-coded equivalent.
There's a finally-chain stopping tracing out there, now it can just use
the deferred stop call and that's it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test needs to call flush-keyspace API endpoint and currently it does it by hand. Not very convenient.
Also in the future there will be the need for _background_ API kicking, the currently used requests package cannot do it, while pylib REST API can
Closesscylladb/scylladb#15565
* github.com:scylladb/scylladb:
test/object_store: Use REST client from pylib
test/pylib: Add flush_keyspace() method to rest client
test/object_store: Wrap yielded managed cluster
Use NullCompactionStrategy for the test_table fixture
rather than using the `no_autocompaction_context`.
Besides being simpler, as regular compaction just comes in
the way for all tests that use `SELECT MUTATION_FRAGMENTS`
The latter would be problematic when we start run cql-pytest
test cases in parallel rather than in serial since it
will inadvertantly affect other test cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#15574
Add a REST API to reload Raft topology state without having to restart a node and use it in `test_fence_hints`. Restarting the node has undesired side effects which cause test flakiness; more details provided in commit messages.
Refactor the test a bit while at it.
Fixes: #15285Closesscylladb/scylladb#15523
* github.com:scylladb/scylladb:
test: test_fencing.py: enable hints_manager=trace logs in `test_fence_hints`
test: test_fencing.py: reload topology through REST API in `test_fence_hints`
test: refactor test_fencing.py
api: storage_service: add REST API to reload topology state
Allow disabling auto-compaction for given table(s)
using either the ks.table syntax or ks:table (as the
api suggests).
The first syntax would likely be more common since
the test tables we automatically create are named
as test_keyspace.test_table so we can pass that name
to `no_autocompaction_context` as is.
test_tools.system_scylla_local_sstable_prepared was
modified to disable auto-compaction only only
the `system.scylla_local` table rather than
the whole `system` keyspace, since it only relies
on this table. Plus, it helps test this change :)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#15575
Enable TRACE level logging on the server that's supposed to send the
hints. Should make it easier to debug failures in the future, if any
happen again.
Restarting a node in order to reload topology may have side effects that
lead to test flakiness. While the node is shutting down, it gives up
leadership. Before it finishes shutting down, another node may become
Raft group 0 leader, then topology coordinator, then send a topology
command, triggering topology state reload on the shutting down node,
causing its topology version to get updated, allowing it to send a
successful hint before it shuts down and restarts. After it restarts, no
more hints will be sent, so the metrics condition we're waiting for (for
a hint to be sent) will never become true (metrics are not persisted
between restarts).
Instead of restarting, reload topology state through the new REST API.
This also makes the test a bit faster.
Fixes#15285
- use `manager.get_cql()` to silence mypy (`manager.cql` is `Optional`)
- extract `metrics.lines_by_prefix('scylla_hints_manager_')` to a helper
function
- when waiting for conditions on metrics, split the condition into
safety and liveness part, and fail early if the safety part does not
hold
- in `exactly_one_hint` send, don't check that `send_errors_metric` is
`0` (it won't be after the next commit)
This PR implements a new procedure for joining nodes to group 0, based on the description in the "Cluster features on Raft (v2)" document. This is a continuation of the previous PRs related to cluster features on raft (https://github.com/scylladb/scylladb/pull/14722, https://github.com/scylladb/scylladb/pull/14232), and the last piece necessary to replace cluster feature checks in gossip.
Current implementation relies on gossip shadow round to fetch the set of enabled features, determine whether the node supports all of the enabled features, and joins only if it is safe. As we are moving management of cluster features to group 0, we encounter a problem: the contents of group 0 itself may depend on features, hence it is not safe to join it unless we perform the feature check which depends on information in group 0. Hence, we have a dependency cycle.
In order to solve this problem, the algorithm for joining group 0 is modified, and verification of features and other parameters is offloaded to an existing node in group 0. Instead of directly asking the discovery leader to unconditionally add the node to the configuration with `GROUP0_MODIFY_CONFIG`, two different RPCs are added: `JOIN_NODE_REQUEST` and `JOIN_NODE_RESPONSE`. The main idea is as follows:
- The new node sends `JOIN_NODE_REQUEST` to the discovery leader. It sends a bunch of information describing the node, including supported cluster features. The discovery leader verifies some of the parameters and adds the node in the `none` state to `system.topology`.
- The topology coordinator picks up the request for the node to be joined (i.e. the node in `none` state), verifies its properties - including cluster features - and then:
- If the node is accepted, the coordinator transitions it to `boostrap`/`replace` state and transitions the topology to `join_group0` state. The node is added to group 0 and then `JOIN_NODE_RESPONSE` is sent to it with information that the node was accepted.
- Otherwise, the node is moved to `left` state, told by the coordinator via `JOIN_NODE_RESPONSE` that it was rejected and it shuts down.
The procedure is not retryable - if a node fails to do it from start to end and crashes in between, it will not be allowed to retry it with the same host_id - `JOIN_NODE_REQUEST` will fail. The data directory must be cleared before attempting to add it again (so that a new host_id is generated).
More details about the procedure and the RPC are described in `topology-over-raft.md`.
Fixes: #15152Closesscylladb/scylladb#15196
* github.com:scylladb/scylladb:
tests: mark test_blocked_bootstrap as skipped
storage_service: do not check features in shadow round
storage_service: remove raft_{boostrap,replace}
topology_coordinator: relax the check in enable_features
raft_group0: insert replaced node info before server setup
storage_service: use join node rpc to join the cluster
topology_coordinator: handle joining nodes
topology_state_machine: add join_group0 state
storage_service: add join node RPC handlers
raft: expose current_leader in raft::server
storage_service: extract wait_for_live_nodes_timeout constant
raft_group0: abstract out node joining handshake
storage_service: pass raft_topology_change_enabled on rpc init
rpc: add new join handshake verbs
docs: document the new join procedure
topology_state_machine: add supported_features to replica_state
storage_service: check destination host ID in raft verbs
group_state_machine: take reference to raft address map
raft_group0: expose joined_group0
Test cases kick scylla to force keyspaces flush (to have the objects on
object store) by hand. Equip the wrapped cluster object with the REST
API class instance for convenience
The assertion for 200 return status code is dropped, REST client does it
behind the scenes
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Test cases use temporary cluster object which is, in fact, cql cluster.
In the future there will be the need to perform more actions on it
rather than just querying it with cql client, so wrap the cluster with
an extendable object
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Check that the cache reader reacts correctly to semaphore's OOM kill
attempt, letting the read to fail, instead of going berserk, trying to
reserve more-and-more memory, until the reserve cannot be satisfied.
With the new procedure to join nodes, testing the scenario in
`test_blocked_bootstrap` becomes very tricky. To recap, the test does
the following:
- Starts a 3-node cluster,
- Shuts down node 1,
- Tries to replace node 1 with node 4, but an error injection is
triggered which causes node 4 to fail after it joins group 0. Note
that pre-JOIN_NODE handshake, this would only result in node 4 being
added to group 0 config, but no modification to the group 0 state
itself is being done - the joining node is supposed to write a request
to join.
- Tries to replace node 1 again with node 5, which should succeed.
The bug that this regression test was supposed to check for was that
node 5 would try to resolve all IPs of nodes added to group 0 config.
Because node 4 shuts down before advertising itself in gossip, the node
5 would get stuck.
The new procedure to join group 0 complicates the situation because a
request to join is written first to group 0 and only then the topology
coordinator modifies the group 0 config. It is possible to add an error
injection to the topology coordinator code so that it doesn't change the
group 0 state and proceeds with bootstrapping the node, but it will only
get stuck trying to add the node. If node 5 tries to join in the
meantime, the topology coordinator may switch to it and try to bootstrap
it instead, but this is basically a 50% chance because it depends on the
order of node 4 and node 5's host IDs in the topology_state_machine
struct.
It should be possible to fix the test with error recovery, but until
then it is marked as skipped.
When presented with queries that use the same named bind variables twice, like this one:
```cql
SELECT p FROM table WHERE p = :x AND c = :x
```
Scylla generated empty `partition_key_bind_indexes` (`pk_indexes`).
`pk_indexes` tell the driver which bind variables it should use to calculate the partition token for a query. Without it, the driver is unable to determine the token and it will send the query to a random node.
Scylla should generate pk_indexes which tell the driver that it can use bind variable with `bind_index = 0` to calculate the partition token for this query.
The problem was that `_target_columns` keep only a single target_column for each bind variable.
In the example above `:x` is compared with both `p` and `c`, but `_target_columns` would contain only one of them, and Scylla wasn't able to tell that this bind variable is compared with a partition key column.
To fix it, let's replace `_target_columns` with `_targets`. `_targets` keeps all comparisons
between bind variables and other expressions, so none of them will be forgotten/overwritten.
A `cql-pytest` reproducer is added.
I also added some comments in `prepare_context.hh/cc` to make it easier to read.
Fixes: https://github.com/scylladb/scylladb/issues/15374Closesscylladb/scylladb#15526
* github.com:scylladb/scylladb:
cql-pytest/test-prepare: remove xfail marker from *pk_indexes_duplicate_named_variables
cql3/prepare_context: fix generating pk_indexes for duplicate named bind variables
cql3: improve readability of prepare_context
cql-pytest: test generation of pk indexes during PREPARE
There's a dedicated forward_service::shutdown() method that's defer-scheduled in main for very early invocation. That's not nice, the fwd service start-shutdown-stop sequence can be made "canonical" by moving the shutting down code into abort source subscription. Similar thing was done for view updates generator in 3b95f4f107
refs: #2737
refs: #4384Closesscylladb/scylladb#15545
* github.com:scylladb/scylladb:
forward_service: Remove .shutdown() method
forward_service: Set _shutdown in abort-source subscription
forward_service: Add abort_source to constructor
When doing a SELECT CAST(b AS int), Cassandra returns a column named
cast(b as int). Currently, Scylla uses a different name -
system.castasint(b). For Cassandra compatibility, we should switch to
the same name.
fixes#14508Closesscylladb/scylladb#14800
Now everything is prepared for the switch, let's do it.
Now let's wait for ICS to enjoy the set of changes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's needed for upcoming changes that will allow ICS to efficiently
retrieve sstable runs.
Next patch will remove candidates from compaction_strategy's interface
to retrieve candidates using this one instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
sstable_run may reject insertion of a sstable if it's going
to break the disjoint invariant of the run, but it's important
that the caller is aware of it, so it can act on it like
generating a new run id for the sstable so it can be inserted
in another run. the tag is important to avoid unknown
problems in this area.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Users of all_sstable_runs() don't want to mutate the runs, but rather
work with their content. So let's avoid copy and make the intention
explicit with the new frozen_sstable_run used as return type
for the interface.
This will guarantee that ICS will be able to fetch uncompacting
runs efficiently.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This interface selects all runs that store at least one of the
sstables in the vector.
But that's very fragile, to the point that even ICS had to
stop using it. A better interface is to return all runs
managed by the set and allow compaction manager to do its
filtering.
We want to use it in ICS to avoid the overhead of rebuilding
sstable runs which may be expensive as sorting is performed
to guarantee the disjoint invariant.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Issue #15374 has been fixed, so these tests can be enabled.
Duplicate bind variable names are now handled correctly.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add some tests that test whether `pk indexes` are generated correctly.
When a driver asks to prepare a statement, Scylla's response includes
the metadata for this prepared statement.
In this metadata there's `pk indexes`, which tells the driver which
bind variable values it should use to calculate the partition token.
For a query like:
SELECT * FROM t WHERE p2 = ? AND p1 = ? AND p3 = ?
The correct pk_indexes would be [1, 0, 2], which means
"To calculate the token calculate Hash(bind_vars[1] | bind_vars[0] | bind_vars[2])".
More information is available in the specification:
1959502d8b/doc/native_protocol_v4.spec (L699-L707)
Two tests are marked as xfail because of #15374 - Scylla doesn't correctly handle using the same
named variable in multiple places. This will be fixed soon.
I couldn't find a good place for these tests, so I created a new file - test_prepare.py.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>