Doing it with format("{}", foo) is correct, but to_string is
a bit more lightweight.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28630
Add explicit erm-holding variables in all replica-side RPC handlers.
This is required to ensure that tablet migration waits for in-flight
replica requests even if a non-replica coordinator has been fenced out.
Holding erms on the replica side may increase the global-barrier wait
time, since the barrier must drain these requests. We believe this
is acceptable because:
* We already hold erms during replica-side request execution, but in
an ad-hoc, non-systemic way in lower layers of storage_proxy
(e.g. in sp::mutate_locally and do_query_tablets).
* Replica requests are bounded by replica-side timeouts, so the
global-barrier wait time cannot exceed the maximum of these timeouts.
For Paxos verbs, we use token_metadata_guard, which wraps the ERM and
automatically refreshes it when tablet migration does not affect the
current token; see the token_metadata_guard comments for details.
We use this guard only for Paxos verbs because regular reads and writes
already hold raw erms in storage_proxy and on the coordinators.
The erms must be held in all RPC handlers that support fencing — that
is, those with a fencing_token parameter in storage_proxy.idl.
Counter updates already hold erms in
mutate_counter_on_leader_and_replicate.
Fix test_tablets2::test_timed_out_reader_after_cleanup: the tablets
barrier now waits for all nodes. As a result, the replica read
is expected to finish, rather than fail due to the tablet having
moved as it did previously. The test is renamed to
test_tablets_barrier_waits_for_replica_erms to better reflect its
purpose.
Refs scylladb/scylladb#26864
It allows dropping the local lambdas passed into with_scheduling_group()
calls. Overall the code flow becomes more readable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In this PR we add a basic implementation of the strongly-consistent tables:
* generate raft group id when a strongly-consistent table is created
* persist it into system.tables table
* start raft groups on replicas when a strongly-consistent tablet_map reaches them
* add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods
* the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()`
* strongly-consistent versions of the `select_statement` and `modification_statement` are added
* a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion.
Limitations:
* for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups.
* No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately.
* No tablet balancing -- migration/split/merges require separate complicated code.
The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set.
Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290)
One can check the strongly consistent writes and reads locally via cqlsh:
scylla.yaml:
```
experimental_features:
- strongly-consistent-tables
```
cqlsh:
```
CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local';
CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int);
INSERT INTO my_ks.test (pk, c) VALUES (10, 20);
SELECT * FROM my_ks.test WHERE pk = 10;
```
Fixes SCYLLADB-34
Fixes SCYLLADB-32
Fixes SCYLLADB-31
Fixes SCYLLADB-33
Fixes SCYLLADB-56
backport: no need
Closesscylladb/scylladb#27614
* https://github.com/scylladb/scylladb:
test_encryption: capture stderr
test/cluster: add test_strong_consistency.py
raft_group_registry: disable metrics for non-0 groups
strong consistency: implement select_statement::do_execute()
cql: add select_statement.cc
strong consistency: implement coordinator::query()
cql: add modification_statement
cql: add statement_helpers
strong consistency: implement coordinator::mutate()
raft.hh: make server::wait_for_leader() public
strong_consistency: add coordinator
modification_statement: make get_timeout public
strong_consistency: add groups_manager
strong_consistency: add state_machine and raft_command
table: add get_max_timestamp_for_tablet
tablets: generate raft group_id-s for new table
tablet_replication_strategy: add consistency field
tablets: add raft_group_id
modification_statement: remove virtual where it's not needed
modification_statement: inline prepare_statement()
system_keyspace: disable tablet_balancing for strongly_consistent_tables
cql: rename strongly_consistent statements to broadcast statements
Adds a "sstables" array member to manifest.json.
For each sstables, keep the following metadata:
id - a uuid for the sstable (the sstable identifier
if the use-sstable-identifier option was used, otherwise
the sstable uuid generation)
toc_name - the name of the TOC.txt file
data_size and index_size - in bytes
first_token and last_token - of the sstable first and last keys.
Fixes: SCYLLADB-196
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a table member to manifest.json with the keyspace_name,
table_name, table_id, tablets_type, and, for tablets-enabled tables, get
tablet_count on each shard and write the minimum to manifest.json.
For vnodes-based tables, tablet_count=0.
For now, `tablets_type` may be either `none` for vnodes tables, or
`powof2` for tablets tables. In the future, when we support arbitrary
tablt boundaries, this will be reflected here, and it is likely we
would backup the whole tablets map sperately to get all tablet boundaries.
Fixes SCYLLADB-195
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And keep the options for now in the local_snapshot_writer.
The options will be used by following patches to pass
extra metadata like the snapshot creation time, expiration time, etc.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add metadata about the node: host_id, datacenter, and rack.
This enables dc- or rack- aware restore.
Today this information is "encoded" into the snapshot hierarchy
prefixes, but if all manifest files would be stored in a flat
directory, we'd need to encode that metadata in the object name,
but it'd be better for the manifest contents to be self descriptive.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add metadata about the manifest itself:
A version and the manifest scope (currently "node",
but in the future, may also be "shard", or "tablet")
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Strongly consistent writes require knowing the maximum timestamp of
locally applied mutations to guarantee monotonically increasing
timestamps for subsequent writes.
This commit adds a function that returns the maximum timestamp for a
given tablet.
Why it is safe to use this function with deleted cells:
* Tombstones are included in memtable.get_max_timestamp() calculations.
* The maximum timestamp of a memtable is used to initialize the maximum
timestamp of the resulting sstable.
* During compaction, a new sstable’s maximum timestamp is initialized as
the maximum of the contributing sstables.
Use coroutine::as_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
Not using coroutine::try_future(), because on the error path, the
querier has to be closed.
Table needs flush if not all its memtable lists are empty.
To be used in the next patch for a unit test.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
With the additional file_stat overload introduced in
3e9b071838, use the opened
directory for more efficient, relative-path based stat.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The functor is called both on the data directory as well
as on the staging directory, so the warning printed if the
found file is not the same inode should print the given path,
not datadir / name (as was copy and pasted).
Refs #27635
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
staging is based off of datadir, not snapshot_dir.
the issue was introduced in f5ca3657e2.
Refs #27635
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
* seastar f0298e40...4dcd4df5 (29):
> file: provide a default implementation for file_impl::statat
> util: Genralize memory_data_sink
> defer: Replace static_assert() with concept
> treewide: drop the support of fmtlib < 9.0.0
> test: Improve resilience of netsed scheduling fairness test
> Merge 'file: Use query_device_alignment_info in blkdev_alignments ' from Kefu Chai
file: Put alignment helpers in anonymous namespace
file: Use query_device_alignment_info in blkdev_alignments
> Merge 'file: Query physical block size and minimum I/O size' from Kefu Chai
file: Apply physical_block_size override to filesystem files
file: Use designated initializers in xfs_alignments
iotune: Add physical block size detection
disk_params: Add support for physical_block_size overrides from io_properties.yaml
block_device: Query alignment requirements separately for memory and I/O
> Merge 'json: formatter: fix formatting of std:string_view' from Benny Halevy
json: formatter: fix formatting of std:string_view
json: formatter: make sure std::string_view conforms to is_string_like
Fixes#27887
> demos:improve the output of demo_with_io_intent() in file_demo
> test: Add accept() vs accept_abort() socket test
> file: Refine posix_file_impl alignments initialization
> Add file::statat and a corresponding file_stat overload
> cmake: don't compile memcached app for API < 9
> Merge 'Revert to ~old lifetime semantics for lvalues passed to then()-alikes' from Travis Downs
future: adjust lifetime for lvalue continuations
future: fix value class operator()
> pollable_fd: Unfriend everything
> Merge 'file: experimental_list_directory: use buffered generator' from Benny Halevy
file: experimental_list_directory: use buffered generator
file: define list_directory_generator_type
> Merge 'Make datagram API use temporary_buffer<>-s' from Pavel Emelyanov
net: Deprecate datagram::get_data() returning packet
memcache: Fix indentation after previous patch
memcache: Use new datagram::get_buffers() API
dns: Use new datagram::get_buffers() API
tests: Use new datagram::get_buffers() API
demo: Use new datagram::get_buffers() API
udp: Make datagram implementations return span of temporary_buffer-s
> Merge 'Remove callback from timer_set::complete()' from Pavel Emelyanov
reactor: Fix indentation after previous patch
timers: Remove enabling callback from timer_set::complete()
> treewide: avoid 'static sstring' in favor of 'constexpr string_view'
> resource: Hide hwloc from public interface
> Merge 'Fix handle_exception_type for lvalues' from Travis Downs
futures_test: compile-time tests
function_traits: handle reference_wrapper
> posix_data_sink_impl: Assert to guard put UB
> treewide: fix build with `SEASTAR_SSTRING` undefined
> avoid deprecation warnings for json_exception
> `util/variant_utils`: correct type deduction for `seastar::visit`
> net/dns: fixed socket concurrent access
> treewide: add missing headers
> Merge 'Remove posix file helper file_read_state class' from Pavel Emelyanov
file: Remove file_read_state
test: Add a test for posix_file_impl::do_dma_read_bulk()
> membarrier: simplify locking
Adjust scylla to the following changes in scylla:
- file_stat became polymorphic
- needs explicit inference in table::snapshot_exists, table::get_snapshot_details
- file::experimental_list_directory now returns list_directory_generator_type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#27916
Irritated by prevailing spellchecker comments attached to every PR, I aim to fix them all.
No need to backport, just cosmetic changes.
Closesscylladb/scylladb#27897
* github.com:scylladb/scylladb:
treewide: fix some spelling errors
codespell: ignore `iif` and `tread`
- table, storage_group: add compaction_group_count
- And use to reserve vector capacity before adding an item per compaction_group
- table: reduce allocations by using for_each_compaction_group rather than compaction_groups()
- compaction_groups() may allocate memory, but when called from a synchronous call site, the caller can use for_each_compaction_group instead.
* Improvement, no backport needed
Closesscylladb/scylladb#27479
* github.com:scylladb/scylladb:
table: reduce allocations by using for_each_compaction_group rather than compaction_groups()
replica: storage_group: rename compaction_groups to compaction_groups_immediate
compaction_groups_immediate() may allocate memory, but when called from a
synchronous call site, the caller can use for_each_compaction_group
instead to iterate over the compaction groups with no extra allocations.
Calling compaction_groups_immediate() is still required from an async
context when we want to "sample" the compaction groups
so we can safely iterate over them and yield in the inner loop.
Also, some performance insensitive call sites using
compaction_groups_immediate had been left as they are
to keep them simple.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The method in question knows that it writes snapshot to local filesystem and uses this actively. This PR relaxes this knowledge and splits the logic into two parts -- one that orchestrates sstables snapshot and collects the necessary metadata, and the code that writes the metadata itself.
Closesscylladb/scylladb#27762
* github.com:scylladb/scylladb:
table: Move snapshot_file_set to table.cc
table: Rename and move snapshot_on_all_shards() method
table: Ditch jsondir variable
table, sstables: Pass snapshot name to sstable::snapshot()
table: Use snapshot_writer in write_manifest()
table: Use snapshot_writer in write_schema_as_cql()
table: Add snapshot_writer::sync()
table: Add snapshot_writer::init()
table: Introduce snapshot_writer
table: Move final sync and rename seal_snapshot()
table: Hide write_schema_as_cql()
table: Hide table::seal_snapshot()
table: Open-code finalize_snapshot()
table: Fix indentation after previuous patch
table: Use smp::invoke_on_all() to populate the vector with filenames
table: Don't touch dir once more on seal_snapshot()
table: Open-code table::take_snapshot() into caller lambda
table: Move parts of table::take_snapshot to sstables_manager
table: Introduce table::take_snapshot()
table: Store the result of smp::submit_to in local variable
The table::seal_snapshot() accepts a vector of sstables filenames and
writes them into manifest file. For that, it iterates over the vector
and moves all filenames from it into the streamer object.
The problem is that the vector contains foreign pointers on sets with
sstrings. Not only sets are foreign, sstrings in it are foreign too.
It's not correct to std::move() them to local CPU.
The fix is to make streamer object work on string_view-s and populate it
with non-owning references to the sstrings from aforementioned sets.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27755
Now it's database::snapshot_table_on_all_shards(). This is symmetric to
database::truncate_table_on_all_shards().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the table::snapshot_on_all_shards() is storage-independent and can
stop maintaining the local path variable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently sstable::snapshot() is called with directory name where to put
snapshots into. This patch changes it to accept snapshot name instead.
This makes the table-sstable API be unware of snapshot destination
storage type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The manifest writing code is self-contained in a sense that it needs
list of sstable files and output_stream to write it too. The
snapshot_writer can provide output_stream for specific component, it can
be re-used for manifest writing code, thus making it independent from
local filesystem.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The schema writing code is self-contained in a sense that it needs
schema description and output_stream to write it too. Teach the
snapshot_writer to provide output_stream and make write_schema_as_cql()
be independent from local filesystem.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's an abstract class that defines how to write data and metadata with
table snapshot. Currently it just replaces the storage_options checks
done by table::snapshot_on_all_shards(), but it will soon evolve.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The seal_snapshot() syncs directory at the end. Now when the method is
table.cc-local, it doesn't need to be that careful. It looks nicer if
being renamed to write_manifest() and it's caller that syncs directory
after calling it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method only needs schema description from table. The caller can
pre-get it and pass it as argument. This makes it symmetric with
seal_snapshot() (that will be renamed soon) and reduces the class table
API size.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method is static and has nothing to do with table. The
snapshot_file_set needs to become public, but it will be moved to
table.cc soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a vector of foreign pointers to sets with sstable filenames
that's populated on all shards. The code does the invoke-on-all by hand
to grow the vector with push-back-s. However, if resizing the vector in
advance, shards will be able to just populate their slots.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the logic of take_snapshot() is split between two components
(table and sstables_manager) it's no longer useful
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Move the loop over vector of sstables that calls sstable->snapshot()
into sstables manager.
This makes it symmetric with sstables_manager::delete_atomically() and
allows for future changes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method returns all sstables vector with a guard that prevents this
list from being modified. Currently this is the part of another existing
table::take_snapshot() method, but the newer, smaller one, is more
atomic and self-contained, next patches will benefit from it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>