Commit Graph

43077 Commits

Author SHA1 Message Date
Kefu Chai
617e532859 db: config: drop operator<<() for error_injection_at_startup
it is not used anymore, so let's drop it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18701
2024-05-16 15:10:57 +03:00
Pavel Emelyanov
dffd985401 data_dictionary: Resurrect formatter for keyspace_metadata
It was commented out by the a439ebcfce (treewide: include fmt/ranges.h
and/or fmt/std.h) , probably by mistake

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18665
2024-05-16 15:09:45 +03:00
Pavel Emelyanov
31d05925cc api,database: Move auto-compaction toggle guard
Toggling per-table auto-compaction enabling bit is guarded with
on-database boolean and raii guard. It's only used by a single
api/column_family.cc file, so it can live there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-16 14:42:51 +03:00
Pavel Emelyanov
a43b178f72 api: Move some table manipulation helpers from storage_service
Continuation of the previous patch -- helpers toggling tombstone_gc and
auto_compaction on tables should live in the same file that uses them.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-16 14:42:50 +03:00
Pavel Emelyanov
862fcd7bc7 api: Move table-related calls from storage_service domain
The storage_service/(enable|disable)_(tombstone_gc|auto_compaction)
endpoints are not handled by storage_service _service_ and should rather
live in the column_family/ domain which is handler by replica::database.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-16 14:42:50 +03:00
Pavel Emelyanov
ba53283d21 api: Reimplement some endpoints using existing helpers
The (enable|disable)_(tombstone_gc|auto_compaction) endpoints living in
column_family domain can benefit from the helpers that do the same in
the storage_service domain. The "difference" is that c.f. endpoints do
it per-table, while s.s. ones operate on a vector of tables, so the
former is a corner case of the latter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-16 14:42:50 +03:00
Pavel Emelyanov
231ffa623c api: Lost unset of tombstone-gc endpoints
On stop all endpoints must be unregistered, these three are lost

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-16 14:42:50 +03:00
Michał Jadwiszczak
b3e6a39604 test/cql-pytest/test_describe: add test for UDTs ordering 2024-05-16 13:30:03 +02:00
Michał Jadwiszczak
f29820fb27 cql3/statements/describe_statement: UDTs topological sorting
User-defined types can depend on each other, creating directed acyclic
graph.
In order to support restoring schema from `DESC SCHEMA`, UDTs should be
ordered topologically, not alphabetically as it was till now.
2024-05-16 13:30:03 +02:00
Michał Jadwiszczak
7be938192b cql3/statements/describe_statement: allow to skip alphabetical sorting
In a next commit, we are going to introduce topological sorting of
user-defined types, so alphabetical sorting must be skipped not to
interfere.
2024-05-16 13:30:03 +02:00
Michał Jadwiszczak
8157d260f2 types: add a method to get all referenced user types
The method allows to collect all UDTs used to create a type.
This is required to sort UDTs in a topological order.
2024-05-16 13:30:03 +02:00
Michał Jadwiszczak
573e13e3f1 db/cql_type_parser: use generic topological sorting 2024-05-16 13:30:03 +02:00
Michał Jadwiszczak
3830f3bd23 db/cql_type_parses: futurize raw_builder::build()
In order to use generic topological sort,
build() method needs to return future.
2024-05-16 13:30:03 +02:00
Michał Jadwiszczak
7f04c88395 test/boost: add test for topological sorting 2024-05-16 13:30:03 +02:00
Michał Jadwiszczak
aa08e586fd utils: introduce generic topological sorting algorithm
Until now, we have implemented topological sorting in
db/cql_type_parser.cc but it is specific to its usage.

Now we want to use topological sorting in another place,
so generic sorting algoritm provides one implementation
to be reused in several places.
2024-05-16 13:30:03 +02:00
Nadav Har'El
27ab560abd cql: fix hang during certain SELECT statements
The function intersection(r1,r2) in statement_restrictions.cc is used
when several WHERE restrictions were applied to the same column.
For example, for "WHERE b<1 AND b<2" the intersection of the two ranges
is calculated to be b<1.

As noted in issue #18690, Scylla is inconsistent in where it allows or
doesn't allow these intersecting restrictions. But where they are
allowed they must be implemented correctly. And it turns out the
function intersection() had a bug that caused it to sometimes enter
an infinite loop - when the intent was only to call itself once with
swapped parameters.

This patch includes a test reproducing this bug, and a fix for the
bug. The test hangs before the fix, and passes after the fix.

While at it, I carefully reviewed the entire code used to implement
the intersection() function to try to make sure that the bug we found
was the only one. I also added a few more comments where I thought they
were needed to understand complicated logic of the code.

The bug, the fix and the test were originally discovered by
Michał Chojnowski.

Fixes #18688
Refs #18690

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#18694
2024-05-16 11:25:44 +03:00
Piotr Dulikowski
68eca3778c Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros
This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed.

See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions.

This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh.

The existing mechanism works in the following way:

* Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes
* Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking
* We keep track of the percent of consumed units on each node, this is called `view update backlog`.
* Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level.

This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates.

To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`.

The new algorithm of view update generation looks something like this:
```c++
for(;;) {
    auto updates = generate_updates_batch_with_max_100_rows();
    co_await seastar::sleep(calculate_sleep_time_from_backlogs());
    spawn_background_tasks_for_updates(updates);
}
```
Fixes: https://github.com/scylladb/scylladb/issues/12379

Closes scylladb/scylladb#16819

* github.com:scylladb/scylladb:
  test: add test for bad_allocs during large mv queries
  mv: throttle view update generation for large queries
  exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception
  db/view: extract view throttling delay calculation to a global function
  view_update_generator: add get_storage_proxy()
  storage_proxy: make view backlog getters public
2024-05-16 08:22:54 +02:00
Botond Dénes
af9e173c99 Merge 'repair: Don't get topology via database' from Pavel Emelyanov
Database has token-metadata onboard and other services use it to get topology from. Repair code has simpler and cleaner ways to get access to topology.

Closes scylladb/scylladb#18677

* github.com:scylladb/scylladb:
  repair: Get topology via replication map
  repair: Use repair_service::my_address() in handlers
  repair: Remove repair_meta::_myip
  repair: Use repair_meta::myip() everywhere
  repair: Add repair_service::my_address() method
2024-05-16 08:28:14 +03:00
Raphael S. Carvalho
715ae689c0 Implement fast streaming for intra-node migration
With intra-node migration, all the movement is local, so we can make
streaming faster by just cloning the sstable set of leaving replica
and loading it into the pending one.

This cloning is underlying storage specific, but s3 doesn't support
snapshot() yet (th sstables::storage procedure which clone is built
upon). It's only supported by file system, with help of hard links.
A new generation is picked for new cloned sstable, and it will
live in the same directory as the original.

A challenge I bumped into was to understand why table refused to
load the sstable at pending replica, as it considered them foreign.
Later I realized that sharder (for reads) at this stage of migration
will point only to leaving replica. It didn't fail with mutation
based streaming, because the sstable writer considers the shard --
that the sstable was written into -- as its owner, regardless of what
sharder says. That was fixed by mimicking this behavior during
loading at pending.

test:
./test.py --mode=dev intranode --repeat=100 passes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
a179f37780 test: tablets_test: Test sharding during intra-node migration 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
5f32d2ddb6 test: tablets_test: Check sharding also on the pending host 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
6d809c75fb test: py: tablets: Test writes concurrent with migration 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
ad02d85c16 test: py: tablets: Test crash during intra-node migration 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7956a2991e api, storage_service: Introduce API to wait for topology to quiesce 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
679baff25a dht, replica: Remove deprecated sharder APIs 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
32a191384a test: Avoid using deprecated sharded API
There is not tablet migration in unit tests, so shard_of() can be
safely replaced with shard_for_reads(). Even if it's used for writes.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
539460dd71 db: do_apply_many() avoid deprecated sharded API 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
0f50504c39 replica: mutation_dump: Avoid deprecated sharder API 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7bf5733fa5 repair: Avoid deprecated sharder API 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7c03646f99 table: Remove optimization which returns empty reader when key is not owned by the shard
This check would lead to correctness issues with intra-node migration
because the shard may switch during read, from "read old" to "read
new". If the coordinator used "read old" for shard routing, but table
on the old shard is already using "read new" erm, such a read would
observe empty result, which is wrong.

Drop the optimization. In the scenario above, read will observe all
past writes because:

  1) writes are still using "write both"

  2) writes are switched to "write new" only after all requests which
  might be using "read old" are done

Replica-side coordinators should already route single-key requests to
the correct shard, so it's not important as an optimization.

This issue shows how assumptions about static sharding are embedded in
the current code base and how intra-node migration, by violating those
assumptions, can lead to correctness issues.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
26f2e6aa8e dht: is_single_shard: Avoid deprecated sharder API
All current uses are used in the read path.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
c9e6b4dca7 dht: split_range_to_single_shard: Work with static_sharder only
In preparation for intra-node tablet migration, to avoid
using deprecated sharder APIs.

This function is used for generating sstable sharding metadata.
For tablets, it is not invoked, so we can safely work with the
static sharder. The call site already passes static_sharder only.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
c380aecf64 dht: ring_position_range_sharder: Avoid deprecated sharder APIs
In preparation for tablet intra-node migration.

Existing uses are for reads, so it's safe to use shard_for_reads():
  - in multishard reader
  - in forward_service

The ring_position_range_vector_sharder is used when computing sstable
shards, which for intra-node migration should use the view for
reads. If we haven't completed streaming, sstables should be attached
to the old shard (used by reads). When in write-both-read-new stage,
streaming is complete, reads are using the new shard, and we should
attach sstables to the new shard.

When not in intra-node migration, the view for reads on the pending
node will return the pending shard even if read selector is "read old".
So if pending node restarts during streaming, we will attach to sstables
to the shard which is used by writes even though we're using the selector
for reads.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
a1aac409bf dht: token: Avoid use of deprecated sharder API by switching to static_sharder
The touched APIs are used only with static_sharder.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
dd4a086b87 selective_token_sharder: Avoid use of deprecated sharder API
I analyzed all the uses and all except the alternator/ttl.cc seem to
be interested in the result for the purpose of reading.

Alternator is not supported with tablets yet, so the use was annotated
with a relevant issue.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
eb3a22d5a8 docs: Document tablet sharding vs tablet replica placement 2024-05-16 00:28:47 +02:00
Botond Dénes
635aba435b readers/multishard.cc: use shard_for_reads() instead of shard_of()
The latter is deprecated.
2024-05-16 00:28:47 +02:00
Botond Dénes
bc779ed00c multishard_mutation_query.cc: use shard_for_reads() instead of shard_of()
The latter is deprecated.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
3b7d7088d1 storage_proxy: Extract common code to apply mutations on many shards according to sharder 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
660b3d1765 storage_proxy: Prepare per-partition rate-limiting for intra-node migration
Note: there is a potential problem with rate-limit count going out of sync
during intra-node migration between old and the new shard.

Before this patch, when coordinator accounted and admitted the
request, so the rate_limit_info passed to apply_locally() is
account_only, it was converted to std::monostate for requests to the
local replia. This makes sense because the request was already
accounted by the coordinator.

However, during intra-node migration when we do double writes to two
shards locally, that means that the new shard will not account the
write, it will have lower count than the limiter on the old
shard. This means that the new shard may accept writes which will end
up being rejected. This is not desirable, but not the end of the world
since it's temporary, and the new shard will still protect itself from
overload based on its own rate limiter.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7c3291b5ea storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate()
Cunters are not supported with tablets, so we should not reach this path.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
db2809317d storage_proxy: Prepare mutate_hint() for intra-node tablet migration 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
feafe0f6a7 commitlog_replayer: Avoid deprecated sharder::shard_of()
shard_for_writes() is appropriate, because we're writing.  It can
happen that the tablet was migrated away and no shard is the owner. In
that case the mutation is dropped, as it should be, because "shards"
is empty.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
c9294b1642 lwt: Avoid deprecated sharder::shard_of()
Instead, use shard_for_reads(). The justification is that:

 1) In cas_shard(), we need to pick a single request coordinator.
    shard_for_reads() gives that, which is equivalent to shard_of()
    if there is no intra-node migration.

 2) In paxos handler for prepare(), the shard we execute it on is
    the shard from which we read, so shard_for_reads() is the one.

 3) Updates of paxos state are separate CQL requests, and use their
    own sharding.

 4) Handler for learn is executing updates using calls to
    storage_proxy::mutate_locally() which will use the right sharder for writes

However, the code is still not prepared for intra-node migration, and
possibly regular migration too in case of abandoned requests, because
the locking of paxos state assumes that the shard is static. That
would have to be fixed separately, e.g. by locking both shards
(shard_for_writes()) during migration, so that the set of locked
shards always intersects during migration and local serialization of
paxos state updates is achieved. I left FIXMEs for that.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
1631bab658 compaction: Avoid deprecated sharder::shard_of() 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
9da3bd84c7 dht: Extract dht::static_sharder
Before the patch, dht::sharder could be instantiated and it would
behave like a static sharder. This is not safe with regards to
extensions of the API because if a derived implementation forgets to
override some method, it would incorrectly default to the
implementation from static sharder. Better to fail the compilation in
this case, so extract static sharder logic to dht::static_sharder
class and make all methods in dht::sharder pure virtual.

This also allows us to have algorithms indicate that they only work
with static sharder by accepting the type, and have compile-time
safety for this requirement.

schema::get_sharder() is changed to return the static_sharder&.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
dbca598e99 replica: Deprecate table::shard_of() 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
a1bee16ee9 locator: Deprecate effective_replication_map::shard_of() 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
10a4903d0c dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard
Require users to specify whether we want shard for reads or for writes
by switching to appropriate non-deprecated variant.

For example, shard_of() can be replaced with shard_for_reads() or
shard_for_writes().

The next_shard/token_for_next_shard APIs have only for-reads variant,
and the act of switching will be a testimony to the fact that the code
is valid for intra-node migration.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
b3cdf9a379 tests: tablets: py: Add intra-node migration test 2024-05-16 00:28:47 +02:00