Remove the existing copy constructor to enable the use of the implicit
copy constructor. This fixes the issue of `_sstable_set_ids` not being
copied in the current copy constructor.
Fixes#19519
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 44583eed9e)
Add more logging that provide more visibility into what happens during
topology loading.
Message-ID: <ZnE5OAmUbExVZMWA@scylladb.com>
(cherry picked from commit fb764720d3)
incremental_reader_selector is the mechanism for incremental comsumption
of disjoint sstables on range reads.
tablet_sstable_set was implemented, such that selector is efficient with
tablets.
The problem is selector is vnode addicted and will only consider a given
set exhausted when maximum token is reached.
With tablets, that means a range read on first tablet of a given shard
will also consume other tablets living in the same shard. That results
in combined reader having to work with empty sstable readers of tablets
that don't intersect with the range of the read. It won't cause extra
I/O because the underlying sstables don't intersect with the range of
the read. It's only unnecessary CPU work, as it involves creating
readers (= allocation), feeding them into combined reader, which will
in turn invoke the sstable readers only to realize they don't have any
data for that range.
With 100k tablets (ranges), and 100 tablets per shard, and ~5 sstables
per tablet, there will be this amount of readers (empty or not):
(100k * ((100^2 + 100) / 2) * avg_sstable_per_tablet=5) = ~2.5 billions.
~5000 times more readers, it can be quite significant additional cpu
work, even though I/O dominates the most in scans. It's an inefficiency
that we rather get rid of.
The behavior can be observed from logs (there's 1 sstable for each of
4 tablets, but note how readers are created for every single one of
them when reading only 1 tablet range):
```
table - make_reader_v2 - range=(-inf, {-4611686018427387905, end}]
incremental_reader_selector - create_new_readers(null): selecting on pos {minimum token, w=-1}
sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._34qn42... that has range [{-9151620220812943033, start},{-4813568684827439727, end}]
incremental_reader_selector - create_new_readers(null): selecting on pos {-4611686018427387904, w=-1}
sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._368nk2... that has range [{-4599560452460784857, start},{-78043747517466964, end}]
incremental_reader_selector - create_new_readers(null): selecting on pos {0, w=-1}
sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._38lj42... that has range [{851021166589397842, start},{3516631334339266977, end}]
incremental_reader_selector - create_new_readers(null): selecting on pos {4611686018427387904, w=-1}
sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._3dba82... that has range [{5065088566032249228, start},{9215673076482556375, end}]
```
Fix is about making sure the tablet set won't select past the
supplied range of the read.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#18556
Closesscylladb/scylladb#18616
* github.com:scylladb/scylladb:
replica: Make it explicit table's sstable set is immutable
replica: avoid reallocations in tablet_sstable_set
replica: Avoid compound set if only one sstable set is filled
Add rwlock which prevents storage groups from being added/deleted
while some other layers itereates over them (or their compaction
groups).
Add methods to iterate over storage groups with the lock held.
In the following patches, storage groups (and so also sstables sets)
will be allocated only for tablets that are located on this shard.
Some layers may try to read non-existing sstable sets.
Handle this case as if the sstables set was empty instead of calling
on_internal_error.
Unfreeze_gently doesn't have to be a method of
frozen_mutation. It might as well be implemented as
a free function reading from a frozen_mutation
and preparing a mutation gently.
The logic will be used in a later patch
to make a canonical mutation directly from
a frozen_mutation instead of unfreezing it
and then converting it to a canonical_mutation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Just like leaving replica could be optional when adding replica to
tablet, the pending replica can be optional too if we're removing a
replica from tablet
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Will be used by topology loading code to determine which hosts are
needed in topology, even if they're in the left state. We want to load
only left nodes if they are referenced by any tablet, which may happen
temporarily until the replacement replica is rebuilt.
Use co_await unfreeze_gently in the loop body
unfreezing each partition mutation to prevent
reactor stalls when building group0 snapshot
with lots of tablets.
Fixes#15303
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#17688
Make a specialized sstable_set for tablets
via tablet_storage_group_manager::make_sstable_set.
This sstable set takes a snapshot of the storage_groups
(compound) sstable_sets and maps the selected tokens
directly into the tablet compound_sstable_set.
Refs #16876
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The new metadata describes the ongoing resize operation (can be either
of merge, split or none) that spans tablets of a given table.
That's managed by group0, so down nodes will be able to see the
decision when they come back up and see the changes to the
metadata.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The name of the keyspace being part of the partition key is not useful,
the table_id already uniquely identifies the table. The keyspace name
being part of the key, means that code wanting to interact with this
table, often has to resolve the table id, just to be able to provide the
keyspace name. This is counter productive, so make the keyspace_name
just a static column instead, just like table_name already is.
Fixes: #16377Closesscylladb/scylladb#16881
Previously, the tablet information was sent to the drivers
in two pieces within the custom_payload. We had information
about the replicas under the `tablet_replicas` key and token range
information under `token_range`. These names were quite generic
and might have caused problems for other custom_payload users.
Additionally, dividing the information into two pieces raised
the question of what to do if one key is present while the other
is missing.
This commit changes the serialization mechanism to pack all information
under one specific name, `tablets-routing-v1`.
From: Sylwia Szunejko <sylwia.szunejko@scylladb.com>
Closesscylladb/scylladb#16148
ClangBuildAnalyzer reports cql3/cql_statement.hh as being one of the
most expensive header files in the project - being included (mostly
indirectly) in 129 source files, and costing a total of 844 CPU seconds
of compilation.
This patch is an attempt, only *partially* successful, to reduce the
number of times that cql_statement.hh is included. It succeeds in
lowering the number 129 to 99, but not less :-( One of the biggest
difficulties in reducing it further is that query_processor.hh includes
a lot of templated code, which needs stuff from cql_statement.hh.
The solution should be to un-template the functions in
query_processor.hh and move them from the header to a source file, but
this is beyond the scope of this patch and query_processor.hh appears
problematic in other respects as well.
Unfortunately the compilation speedup by this patch is negligible
(the `du -bc build/dev/**/*.o` metric shows less than 0.01% reduction).
Beyond the fact that this patch only removes 30% of the inclusions of
this header, it appears that most of the source files that no longer
include cql_statement.hh after this patch, included anyway many of the
other headers that cql_statement.hh included, so the saving is minimal.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#15212
The system.tablets table stores replica sets as a CQL set type,
which is sorted. This means that if, in a tablet replica set
[n1, n2, n3] n2 is replaced with n4, then on reload we'll see
[n1, n3, n4], changing the relative position of n3 from the third
replica to the second.
The relative position of replicas in a replica set is important
for materialized views, as they use it to pair base replicas with
view replicas. To prepare for materialized views using tablets,
change the persistent data type to list, which preserves order.
The code that generates new replica sets already preserves order:
see locator::replace_replica().
While this changes the system schema, tablets are an experimental
feature so we don't need to worry about upgrades.
Closes#15111
It's needed to implement tablet migration. It stores the current step
of tablet migration state machine. The state machine will be advanced
by the topology change coordinator.
See the "Tablet migration" section of topology-over-raft.md
The methods that take storage_proxy as argument can now accept a
replica::database instead. So update their signatures and update all
callers. With that, system_keyspace.* no longer depends on storage_proxy
directly.
The code was incorrectly passing a data_value of type bytes due to
implicit conversion of the result of serialize() (bytes_opt) to a
data_value object of type bytes_type via:
data_value(std::optional<NativeType>);
mutation::set_static_cell() accepts a data_value object, which is then
serialized using column's type in abstract_type::decompose(data_value&):
bytes b(bytes::initialized_later(), serialized_size(*this, value._value));
auto i = b.begin();
value.serialize(i);
Notice that serialized_size() is taken from the column type, but
serialization is done using data_value's type. The two types may have
a compatible CQL binary representation, but may differ in native
types. serialized_size() may incorrectly interpret the native type and
come up with the wrong size. If the size is too smaller, we end up with
stack or heap corruption later after serialize().
For example, if the column type is utf8 but value holds bytes, the
size will be wrong because even though both use the basic_sstring
type, they have a different layout due to max_size (15 vs 31).
Fixes#13717Closes#13787