Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stem from the fact that resulting reconcilable_result will be large:
1. Large allocations. Serialization of reconcilable_result causes large allocations for storing result rows in std::deque
2. Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s
3. Too large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs https://github.com/scylladb/scylladb/issues/9111.
This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows.
This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do.
My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition):
* Node1: 1 live row, 1M dead rows
* Node2: 1M dead rows, 1 live row
This was designed to trigger reconciliation right from the very start of the query.
Before:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```
After:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```
Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before).
Refs https://github.com/scylladb/scylladb/issues/7929
Refs https://github.com/scylladb/scylladb/issues/3672
Refs https://github.com/scylladb/scylladb/issues/7933
Fixes https://github.com/scylladb/scylladb/issues/9111Closesscylladb/scylladb#15414
* github.com:scylladb/scylladb:
test/topology_custom: add test_read_repair.py
replica/mutation_dump: detect end-of-page in range-scans
tools/scylla-sstable: write: abort parser thread if writing fails
test/pylib: add REST methods to get node exe and workdir paths
test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction}
service/storage_proxy: add trace points for the actual read executor type
service/storage_proxy: add trace points for read-repair
storage_proxy: Add more trace-level logging to read-repair
database: Fix accounting of small partitions in mutation query
database, storage_proxy: Reconcile pages with no live rows incrementally
* tools/jmx d107758...8d15342 (2):
> Revert "install-dependencies.sh: do not install weak dependencies"
> install-dependencies.sh: do not install weak dependencies Especially for Java, we really do not need the tens of packages and MBs it adds, just because Java apps can be built and use sound and graphics and whatnot.
For JSON objects represented as map<ascii, int>, don't treat ASCII keys
as a nested JSON string. We were doing that prior to the patch, which
led to parsing errors.
Included the error offset where JSON parsing failed for
rjson::parse related functions to help identify parsing errors
better.
Fixes: #7949
Signed-off-by: Michael Huang <michaelhly@gmail.com>
Closesscylladb/scylladb#15499
Consider the following code snippet:
```c++
future<> foo() {
semaphore.consume(1024);
}
future<> bar() {
return _allocating_section([&] {
foo();
});
}
```
If the consumed memory triggers the OOM kill limit, the semaphore will throw `std::bad_alloc`. The allocating section will catch this, bump std reserves and retry the lambda. Bumping the reserves will not do anything to prevent the next call to `consume()` from triggering the kill limit. So this cycle will repeat until std reserves are so large that ensuring the reserve fails. At this point LSA gives up and re-throws the `std::bad_alloc`. Beyond the useless time spent on code that is doomed to fail, this also results in expensive LSA compaction and eviction of the cache (while trying to ensure reserves).
Prevent this situation by throwing a distinct exception type which is derived from `std::bad_alloc`. Allocating section will not retry on seeing this exception.
A test reproducing the bug is also added.
Fixes: #15278Closesscylladb/scylladb#15581
* github.com:scylladb/scylladb:
test/boost/row_cache_test: add test_cache_reader_semaphore_oom_kill
utils/logalloc: handle utils::memory_limit_reached in with_reclaiming_disabled()
reader_concurrency_semaphore: use utils::memory_limit_reached exception
utils: add memory_limit_reached exception
The descriptor in question is used to parse sstable's file path and return back the result. Parser, among "relevant" info, also parses sstable directory and keyspace+table names. However, there are no code (almost) that needs those strings. And the need to construct descriptor with those makes some places obscurely use empty strings.
The PR removes sstable's directory, keyspace and table names from descriptor and, while at it, relaxes the sstable directory code that makes descriptor out of a real sstable object by (!) parsing its Data file path back.
Closesscylladb/scylladb#15617
* github.com:scylladb/scylladb:
sstables: Make descriptor from sstable without parsing
sstables: Do not keep directory, keyspace and table names on descriptor
sstables: Make tuple inside helper parser method
sstables: Do not use ks.cf pair from descriptor
sstables: Return tuple from parse_path() without ks.cf hints
sstables: Rename make_descriptor() to parse_path()
When loading unshared remote sstable, sstable_directory needs to make a
descriptor out of a real sstable. For that it parses the sstable's Data
component path which is pretty weird. It's simpler to make descriptor
out of the ssatble itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now no code uses those strings. Even worse -- there are some places that
need to provide some strings but don't have real values at hand, so just
hard-code the empty strings there (because they are really not used).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This just moves the std::make_tuple() call into internal static path
parsing helper to make the next patch smaller and nicer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's only one place that needs ks.cf pair from the parsed desctipror
-- sstables loader from tools/. This code already has ks.cf from the
tuple returned after parsing and can use them.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two path parsers. One of them accepts keyspace and table names
and the other one doesn't. The latter is then supposed to parse the
ks.cf pair from path and put it on the descriptor. This patch makes this
method return ks.cf so that later it will be possible to remove these
strings from the desctiptor itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All `sstable_set_impl` subclasses/implementations already keep a `schema_ptr` so we can make `sstable_set_impl::make_incremental_selector` function return both the selector and the schema that's being used by it.
That way, we can use the returned schema in `sstable_set::make_incremental_selector` function instead of `sstable_set::_schema` field which makes the field unused and allows us to remove it alltogether and reduce the memory footprint of `sstable_set` objects.
Closesscylladb/scylladb#15570
* github.com:scylladb/scylladb:
sstable_set: Remove unused _schema field
sstable_set_impl: Return also schema from make_incremental_selector
When joining the cluster in raft topology mode, the new node asks some
existing node in the cluster to put its information to the
`system.topology` table. Later, the topology coordinator is supposed to
contact the joining node back, telling it that it was added to group 0
and accepted, or rejected. Due to the fact that the topology coordinator
might not manage to successfully contact the joining node, in order not
to get stuck it might decide to give up and move the node to left state
and forget about it (this not always happens as of now, but will in the
future). Because of that, the joining node must use a timeout when
waiting for a response because it's not guaranteed that it will ever
receive it.
There is an additional complication: the topology coordinator might be
busy and not notice the request to join for a long time. For example, it
might be migrating tablets or joining other nodes which are in the queue
before it. Therefore, it's difficult to choose a timeout which is long
enough for every case and still not too long.
Such a failure was observed to happen in ARM tests in debug mode. In
order to unblock the CI the timeout is increased from 30 seconds to 3
minutes. As a proper solution, the procedure will most likely have to be
adjusted in a more significant way.
Fixes: #15600Closesscylladb/scylladb#15618
The method really parses provided path, so the existing name is pretty
confusing. It's extra confusing in the table::get_snapshot_details()
where it's just called and the return value is simply ignored.
Named "parse_..." makes it clear what the method is for.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Also rephase the messages a bit so they are more uniform.
The goal of this change is to make semaphore mismatches easier to
diagnose, by including the table name and the permit name in the
printout.
While at it, add a test for semaphore mismatch, it didn't have one.
Refs: #15485Closesscylladb/scylladb#15508
Tracing is one of two global service left out there with its starting and stopping being pretty hairy. In order to de-globalize it and keep its start-stop under control the existing start-stop sequence is worth cleaning. This PR
* removes create_ , start_ and stop_ wrappers to un-hide the global tracing_instance thing
* renames tracing::stop() to shutdown() as it's in fact shutdown
* coroutinizes start/shutdown/stop while at it
Squeezed parts from #14156 that don't reorder start-stop calls
Closesscylladb/scylladb#15611
* github.com:scylladb/scylladb:
main: Capture local tracing reference to stop tracing
tracing: Pack testing code
tracing: Remove stop_tracing() wrapper
tracing: Remove start_tracing() wrapper
tracing: Remove create_tracing() wrapper
tracing: Make shutdown() re-entrable
tracing: Coroutinize start/shutdown/stop
tracing: Rename helper's stop() to shutdown()
If the constructor of row_cache throws, `_partitions` is cleared in the
wrong allocator, possibly causing allocator corruption.
Fix that.
Fixes#15632Closesscylladb/scylladb#15633
The test does (among other things) the following:
1. Create a cache reader with buffer of size 1 and fill the buffer.
2. Update the cache.
3. Check that the reader produces the first mutation as seen before
the update (because the buffer fill should have snapshotted the first
mutation), and produces other mutation as seen after the update.
However, the test is not guaranteed to stop after the update succeeds.
Even during a successful update, an allocation might have failed
(and been retried by an allocation_section), which will cause the
body of with_allocation_failures to run again. On subsequent runs
the last check (the "3." above) fails, because the first mutation
is snapshotted already with the new version.
Fix that.
Closesscylladb/scylladb#15634
When a tablet is migrated into a new home, we need to clean its storage (i.e. the compaction group) in the old home. This includes its presence in row cache, which can be shared by multiple tablets living in the same shard.
For exception safety, the following is done first in a "prepare phase" during cache invalidation.
1) take a compaction guard, to stop and disable compaction
2) flush memtable(s).
3) builds a list of all sstables, which represents all the storage of the tablet.
Then once cache is invalidated successfully, we then clear the sstable sets of the the group in the "execution phase", to prevent any background op from incorrectly picking them and also to allow for their deletion.
All the sstables of a tablet are deleted atomically, in order to guarantee that a failure midway won't cause data resurrection if it happens tablet is migrated back into the old home.
Closesscylladb/scylladb#15524
* github.com:scylladb/scylladb:
replica: Clean up storage of tablet on migration
replica: Add async gate to compaction_group
replica: Coroutinize compaction_group::stop()
replica: Make compaction group flush noexcept
Define sstable_set_impl::selector_and_schema_t type as a tuple that
contains both a newly created selector and a schema that the selector
is using.
This will allow removal of _schema field from sstable_set class as
the only place it was used was make_incremental_selector.
Signed-off-by: Piotr Jastrzebski <haaawk@gmail.com>
When a tablet is migrated into a new home, we need to clean its
storage (i.e. the compaction group) in the old home.
This includes its presence in row cache, which can be shared by
multiple tablets living in the same shard.
For exception safety, the following is done first in a "prepare
phase" during cache invalidation.
1) take a compaction guard, to stop and disable compaction
2) flush memtable(s).
3) builds a list of all sstables, which represents all the
storage of the tablet.
Then once cache is invalidated successfully, we then clear
the sstable sets of the the group in the "execution phase",
to prevent any background op from incorrectly picking them
and also to allow for their deletion.
All the sstables of a tablet are deleted atomically, in order
to guarantee that a failure midway won't cause data resurrection
if it happens tablet is migrated back into the old home.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
this is redundant code that should have be gone a long time ago.
the snippet (which lies above the code being deleted):
db.invoke_on_all([] (replica::database& db) {
db.get_tables_metadata().for_each_table([] (table_id, lw_shared_ptr<replica::table> table) {
replica::table& t = *table;
t.enable_auto_compaction();
});
}).get();
provides the same thing as this code being deleted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15597
The following new commands are implemented:
* disablebackup
* disablebinary
* disablegossip
* enablebackup
* enablebinary
* enablegossip
* gettraceprobability
* help
* settraceprobability
* statusbackup
* statusbinary
* statusgossip
* version
All are associated with tests. All tests (both old and new) pass with both the scylla-native and the cassandra nodetool implementation.
Refs: https://github.com/scylladb/scylladb/issues/15588Closesscylladb/scylladb#15593
* github.com:scylladb/scylladb:
tools/scylla-nodetool: implement help operation
tools/scylla-nodetool: implement the traceprobability commands
tools/scylla-nodetool: implement the gossip commands
tools/scylla-nodetool: implement the binary commands
tools/scylla-nodetool: implement backup related commands
tools/scylla-nodetool: implement version command
test/nodetool: introduce utils.check_nodetool_fails_with()
test/nodetool: return stdout of nodetool invokation
test/nodetool/rest_api_mock.py: fix request param matching
tools/scylla-nodetool: compact: remove --partition argument
tools/scylla-nodetool: scylla_rest_client: add support delete method
tools/scylla-nodetool: get rid of check_json_type()
tools/scylla-nodetool: log more details for failed requests
tools/scylla-*: use operation_option for positional options
tools/utils: add support for operation aliases
Checking that nodetool fails with a given message turned out to be a
common pattern, so extract the logic for checking this into a method of
its own. Refactor the existing tests to use it, instead of the
hand-coded equivalent.
This PR updates the information on the ScyllaDB vs. Cassandra compatibility. It covers the information from https://github.com/scylladb/scylladb/issues/15563, but there could more more to fix.
@tzach @scylladb/scylla-maint Please review this PR and the page covering our compatibility with Cassandra and let me know if you see anything else that needs to be fixed.
I've added the updates with separate commits in case you want to backport some info (e.g. about AzureSnitch).
Fixes https://github.com/scylladb/scylladb/issues/15563Closesscylladb/scylladb#15582
* github.com:scylladb/scylladb:
doc: deprecate Thrift in Cassandra compatibility
doc: remove row/key cache from Cassandra compatibility
doc: add AzureSnitch to Cassandra compatibility
The sentence says that if table args are provided compaction will run on
all tables. This is ambigous, so the sentence is rephrased to specify
that compaction will only run on the provided tables.
Closesscylladb/scylladb#15394
There's a finally-chain stopping tracing out there, now it can just use
the deferred stop call and that's it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it's confusing, as it doesn't stop tracing, but rather shuts it down
on all shards. The only caller of it can be more descriptive without the
wrapper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Today's shutdown() and its stop() peer are very restrictive in a way
callers should use them. There's no much point in it, making shutdown()
re-entrable, as for other services, will let relaxing callers code here
and in next patches
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>