Files
scylladb/repair/writer.hh
Asias He 0aabf51380 repair: Fix sstable_list_to_mark_as_repaired with multishard writer
It was obseved:

```
test_repair_disjoint_row_2nodes_diff_shard_count was spuriously failing due to
segfault.

backtrace pointed to a failure when allocating an object from the chain of
freed objects, which indicates memory corruption.

(gdb) bt
    at ./seastar/include/seastar/core/shared_ptr.hh:275
    at ./seastar/include/seastar/core/shared_ptr.hh:430
Usual suspect is use-after-free, so ran the reproducer in the sanitize mode,
which indicated shared ptr was being copied into another cpu through the
multi shard writer:

seastar - shared_ptr accessed on non-owner cpu, at: ...
--------
seastar::smp_message_queue::async_work_item<mutation_writer::multishard_writer::make_shard_writer...

```

The multishard writer itself was fine, the problem was in the streaming consumer
for repair copying a shared ptr. It could work fine with same smp setting, since
there will be only 1 shard in the consumer path, from rpc handler all the way
to the consumer. But with mixed smp setting, the ptr would be copied into the
cpus involved, and since the shared ptr is not cpu safe, the refcount change
can go wrong, causing double free, use-after-free.

To fix, we pass a generic incremental repair handler to the streaming
consumer. The handler is safe to be copied to different shards. It will
be a no op if incremental repair is not enabled or on a different shard.

A reproducer test is added. The test could reproduce the crash
consistently before the fix and work well after the fix.

Fixes #27666

Closes scylladb/scylladb#27870
2026-01-08 21:55:18 +02:00

164 lines
4.5 KiB
C++

#pragma once
#include <seastar/core/future.hh>
#include <seastar/core/shared_ptr.hh>
#include "schema/schema_fwd.hh"
#include "reader_permit.hh"
#include "service/topology_guard.hh"
#include "streaming/stream_reason.hh"
#include "repair/decorated_key_with_hash.hh"
#include "readers/upgrading_consumer.hh"
#include "./sstables/shared_sstable.hh"
using namespace seastar;
namespace db {
class system_distributed_keyspace;
namespace view {
class view_update_generator;
}
}
class mutation_fragment_queue {
public:
class impl {
std::vector<mutation_fragment_v2> _pending;
public:
virtual future<> push(mutation_fragment_v2 mf) = 0;
virtual void abort(std::exception_ptr ep) = 0;
virtual void push_end_of_stream() = 0;
virtual ~impl() {}
future<> flush() {
for (auto&& mf : _pending) {
co_await push(std::move(mf));
}
_pending.clear();
}
std::vector<mutation_fragment_v2>& pending() {
return _pending;
}
};
private:
class consumer {
std::vector<mutation_fragment_v2>& _fragments;
public:
explicit consumer(std::vector<mutation_fragment_v2>& fragments)
: _fragments(fragments)
{}
void operator()(mutation_fragment_v2 mf) {
_fragments.push_back(std::move(mf));
}
};
seastar::shared_ptr<impl> _impl;
upgrading_consumer<consumer> _consumer;
public:
mutation_fragment_queue(schema_ptr s, reader_permit permit, seastar::shared_ptr<impl> impl)
: _impl(std::move(impl))
, _consumer(*s, std::move(permit), consumer(_impl->pending()))
{}
future<> push(mutation_fragment mf) {
_consumer.consume(std::move(mf));
return _impl->flush();
}
void abort(std::exception_ptr ep) {
_impl->abort(ep);
}
void push_end_of_stream() {
_impl->push_end_of_stream();
}
};
class repair_writer : public enable_lw_shared_from_this<repair_writer> {
schema_ptr _schema;
reader_permit _permit;
// Current partition written to disk
lw_shared_ptr<const decorated_key_with_hash> _current_dk_written_to_sstable;
// Is current partition still open. A partition is opened when a
// partition_start is written and is closed when a partition_end is
// written.
bool _partition_opened;
named_semaphore _sem{1, named_semaphore_exception_factory{"repair_writer"}};
bool _created_writer = false;
uint64_t _estimated_partitions = 0;
// Holds the sstables produced by repair
sstables::sstable_list _sstables;
public:
class impl {
public:
virtual mutation_fragment_queue& queue() = 0;
virtual future<> wait_for_writer_done() = 0;
virtual void create_writer(lw_shared_ptr<repair_writer> writer) = 0;
virtual ~impl() = default;
};
private:
std::unique_ptr<impl> _impl;
mutation_fragment_queue* _mq;
public:
repair_writer(
schema_ptr schema,
reader_permit permit,
std::unique_ptr<impl> impl)
: _schema(std::move(schema))
, _permit(std::move(permit))
, _impl(std::move(impl))
, _mq(&_impl->queue())
{}
void set_estimated_partitions(uint64_t estimated_partitions) {
_estimated_partitions = estimated_partitions;
}
uint64_t get_estimated_partitions() {
return _estimated_partitions;
}
void create_writer() {
_impl->create_writer(shared_from_this());
_created_writer = true;
}
future<> do_write(lw_shared_ptr<const decorated_key_with_hash> dk, mutation_fragment mf);
future<> wait_for_writer_done();
named_semaphore& sem() {
return _sem;
}
schema_ptr schema() const noexcept {
return _schema;
}
mutation_fragment_queue& queue() {
return _impl->queue();
}
sstables::sstable_list& get_sstable_list_to_mark_as_repaired() {
return _sstables;
}
private:
future<> write_start_and_mf(lw_shared_ptr<const decorated_key_with_hash> dk, mutation_fragment mf);
future<> write_partition_end();
future<> write_end_of_stream();
};
lw_shared_ptr<repair_writer> make_repair_writer(
schema_ptr schema,
reader_permit permit,
streaming::stream_reason reason,
sharded<replica::database>& db,
sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<db::view::view_update_generator>& view_update_generator,
service::frozen_topology_guard topo_guard);