In c5668d99, a new source file row_cache.cc was added to the `db` target,
but with an extraneous trailing comma. In CMake's target_sources(),
source files should be space-separated - any comma is interpreted as
part of the filename, causing build failures like:
```
CMake Error at db/CMakeLists.txt:2 (target_sources):
Cannot find source file:
row_cache.cc,
```
Fix the issue by removing the trailing comma.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22754
This series extends the table schema with per-table tablet options.
The options are used as hints for initial tablet allocation on table creation and later for resize (split or merge) decisions,
when the table size changes.
* New feature, no backport required
Closesscylladb/scylladb#22090
* github.com:scylladb/scylladb:
tablets: resize_decision: get rid of initial_decision
tablet_allocator: consider tablet options for resize decision
tablet_allocator: load_balancer: table_size_desc: keep target_tablet_size as member
network_topology_strategy: allocate_tablets_for_new_table: consider tablet options
network_topology_strategy: calculate_initial_tablets_from_topology: precalculate shards per dc using for_each_token_owner
network_topology_strategy: calculate_initial_tablets_from_topology: set default rf to 0
cql3: data_dictionary: format keyspace_metadata: print "enabled":true when initial_tablets=0
cql3/create_keyspace_statement: add deprecation warning for initial tablets
test: cqlpy: test_tablets: add tests for per-table tablet options
schema: add per-table tablet options
feature_service: add TABLET_OPTIONS cluster schema feature
This pull request is an implementation of vector data type similar to one used by Apache Cassandra.
The patch contains:
- implementation of vector_type_impl class
- necessary functionalities similar to other data types
- support for serialization and deserialization of vectors
- support for Lua and JSON format
- valid CQL syntax for `vector<>` type
- `type_parser` support for vectors
- expression adjustments such as:
- add `collection_constructor::style_type::vector`
- rename `collection_constructor::style_type::list` to `collection_constructor::style_type::list_or_vector`
- vector type encoding (for drivers)
- unit tests
- cassandra compatibility tests
- necessary documentation
Co-authored-by: @janpiotrlakomy
Fixes https://github.com/scylladb/scylladb/issues/19455Closesscylladb/scylladb#22488
* github.com:scylladb/scylladb:
docs: add vector type documentation
cassandra_tests: translate tests covering the vector type
type_codec: add vector type encoding
boost/expr_test: add vector expression tests
expression: adjust collection constructor list style
expression: add vector style type
test/boost: add vector type cql_env boost tests
test/boost: add vector type_parser tests
type_parser: support vector type
cql3: add vector type syntax
types: implement vector_type_impl
these unused includes were identified by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.
also, took this opportunity to remove an unused namespace alias. and
add an include which is used actually. please note,
`std::ranges::pop_heap()` and friends are actually provided by
`<algorithm>` not `<ranges>`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22716
Unlike with vnodes, each tablet is served only by a single
shard, and it is associated with a memtable that, when
flushed, it creates sstables which token-range is confined
to the tablet owning them.
On one hand, this allows for far better agility and elasticity
since migration of tablets between nodes or shards does not
require rewriting most if not all of the sstables, as required
with vnodes (at the cleanup phase).
Having too few tablets might limit performance due not
being served by all shards or by imbalance between shards
caused by quantization. The number of tabelts per table has to be
a power of 2 with the current design, and when divided by the
number of shards, some shards will serve N tablets, while others
may serve N+1, and when N is small N+1/N may be significantly
larger than 1. For example, with N=1, some shards will serve
2 tablet replicas and some will serve only 1, causing an imbalance
of 100%.
Now, simply allocating a lot more tablets for each table may
theoretically address this problem, but practically:
a. Each tablet has memory overhead and having too many tablets
in the system with many tables and many tablets for each of them
may overwhelm the system's and cause out-of-memory errors.
b. Too-small tablets cause a proliferation of small sstables
that are less efficient to acces, have higher metadata overhead
(due to per-sstable overhead), and might exhaust the system's
open file-descriptors limitations.
The options introduced in this change can help the user tune
the system in two ways:
1. Sizing the table to prevent unnecessary tablet splits
and migrations. This can be done when the table is created,
or later on, using ALTER TABLE.
2. Controlling min_per_shard_tablet_count to improve
tablet balancing, for hot tables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This config item controls how many CPU-bound reads are allowed to run in
parallel. The effective concurrency of a single CPU core is 1, so
allowing more than one CPU-bound reads to run concurrently will just
result in time-sharing and both reads having higher latency.
However, restricting concurrency to 1 means that a CPU bound read that
takes a lot of time to complete can block other quick reads while it is
running. Increase this default setting to 2 as a compromise between not
over-using time-sharing, while not allowing such slow reads to block the
queue behind them.
Fixes: #22450Closesscylladb/scylladb#22679
This PR enhances the internode_compression configuration option in two ways:
1. Add validation for option values
Previously, we silently defaulted to 'none' when given invalid values. Now we
explicitly validate against the three supported values (all, dc, none) and
reject invalid inputs. This provides better error messages when users
misconfigure the option.
2. Fix documentation rendering
The help text for this option previously used C++ escape sequences which
rendered incorrectly in Sphinx-generated HTML. We now use bullet points with
'*' prefix to list the available values, matching our documentation style
for other config options. This ensures consistent rendering in both CLI
and HTML outputs.
Note: The current documentation format puts type/default/liveness information
in the same bullet list as option values. This affects other config options
as well and will need to be addressed in a separate change.
---
this improves the handling of invalid option values, and improves the doc rendering, neither of which is critical. hence no need to backport.
Closesscylladb/scylladb#22548
* github.com:scylladb/scylladb:
config: validate internode_compression option values
config: start available options with '*'
The view builder builds a view by going over the entire token ring,
consuming the base table partitions, and generating view updates for
each partition.
A view is considered as built when we complete a full cycle of the
token ring. Suppose we start to build a view at a token F. We will
consume all partitions with tokens starting at F until the maximum
token, then go back to the minimum token and consume all partitions
until F, and then we detect that we pass F and complete building the
view. This happens in the view builder consumer in
`check_for_built_views`.
The problem is that we check if we pass the first token F with the
condition `_step.current_token() >= it->first_token` whenever we consume
a new partition or the current_token goes back to the minimum token.
But suppose that we don't have any partitions with a token greater than
or equal to the first token (this could happen if the partition with
token F was moved to another node for example), then this condition will never be
satisfied, and we don't detect correctly when we pass F. Instead, we
go back to the minimum token, building the same token ranges again,
in a possibly infinite loop.
To fix this we add another step when reaching the end of the reader's
stream. When this happens it means we don't have any more fragments to
consume until the end of the range, so we advance the current_token to
the end of the range, simulating a partition, and check for built views
in that range.
Fixesscylladb/scylladb#21829Closesscylladb/scylladb#22493
Enabled with the tablets_rack_aware_view_pairing cluster feature
rack-aware pairing pairs base to view replicas that are in the
same dc and rack, using their ordinality in the replica map
We distinguish between 2 cases:
- Simple rack-aware pairing: when the replication factor in the dc
is a multiple of the number of racks and the minimum number of nodes
per rack in the dc is greater than or equal to rf / nr_racks.
In this case (that includes the single rack case), all racks would
have the same number of replicas, so we first filter all replicas
by dc and rack, retaining their ordinality in the process, and
finally, we pair between the base replicas and view replicas,
that are in the same rack, using their original order in the
tablet-map replica set.
For example, nr_racks=2, rf=4:
base_replicas = { N00, N01, N10, N11 }
view_replicas = { N11, N12, N01, N02 }
pairing would be: { N00, N01 }, { N01, N02 }, { N10, N11 }, { N11, N12 }
Note that we don't optimize for self-pairing if it breaks pairing ordinality.
- Complex rack-aware pairing: when the replication factor is not
a multiple of nr_racks. In this case, we attempt best-match
pairing in all racks, using the minimum number of base or view replicas
in each rack (given their global ordinality), while pairing all the other
replicas, across racks, sorted by their ordinality.
For example, nr_racks=4, rf=3:
base_replicas = { N00, N10, N20 }
view_replicas = { N11, N21, N31 }
pairing would be: { N00, N31 }\*, { N10, N11 }, { N20, N21 }
\* cross-rack pair
If we'd simply stable-sort both base and view replicas by rack,
we might end up with much worse pairing across racks:
{ N00, N11 }\*, { N10, N21 }\*, { N20, N31 }\*
\* cross-rack pair
Fixesscylladb/scylladb#17147
* This is an improvement so no backport is required
Closesscylladb/scylladb#21453
* github.com:scylladb/scylladb:
network_topology_strategy_test: add tablets rack_aware_view_pairing tests
view: get_view_natural_endpoint: implement rack-aware pairing for tablets
view: get_view_natural_endpoint: handle case when there are too few view replicas
view: get_view_natural_endpoint: track replica locator::nodes
locator: topology: consult local_dc_rack if node not found by host_id
locator: node: add dc and rack getters
feature_service: add tablet_rack_aware_view_pairing feature
view: get_view_natural_endpoint: refactor predicate function
view: get_view_natural_endpoint: clarify documentation
view: mutate_MV: optimize remote_endpoints filtering check
view: mutate_MV: lookup base and view erms synchronously
view: mutate_MV: calculate keyspace-dependent flags once
This change:
- Remove code that prevented audit from starting if audit_categories,
audit_tables, and audit_keyspaces are not configured
- Set liveness::LiveUpdate for audit_categories, audit_tables,
and audit_keyspaces
- Keep const reference to db::config in audit, so current config values
can be obtained by audit implementation
- Implement function audit::update_config to parse given string, update
audit datastructures when needed, and log the changes.
- Add observers to call audit::update_config when categories,
tables, or keyspaces configuration changes
New functionality, so no backport needed.
Fixes https://github.com/scylladb/scylla-enterprise/issues/1789Closesscylladb/scylladb#22449
* github.com:scylladb/scylladb:
audit: make categories, tables, and keyspaces liveupdatable
audit: move static parsing functions above audit constructors
audit: move statement_category to string conversion to static function
audit: start audit even with empty categories/tables/keyspaces
Previously, the internode_compression option silently defaulted to 'none'
for any unrecognized value instead of validating input. It only compared
against 'all' and 'dc', making it error-prone.
Add explicit validation for the three supported values:
- all
- dc
- none
This ensures invalid values are rejected both in command line and YAML
configuration, providing better error messages to users.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
use '*' prefix for config option values instead of escape sequences
The custom Sphinx extension that generates documentation from config.cc
help messages has issues with C++ escape sequences. For example,
"\tall: All traffic" renders incorrectly as "tall: All traffic" in HTML
output.
Instead of using escape sequences, switch to bullet-point style with '*'
prefix which works better in both CLI and HTML rendering. This matches
our existing documentation style for available option values in other
configs.
Note: This change puts type/default/liveness info in the same bullet
list as option values. This limitation affects other similar config
options and will need to be addressed comprehensively in a future
change.
Refs scylladb/scylladb#22423
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This change is introduced due to lack of support for vector class name,
used by type_parser to create data_type based on given class name
(especially compound class name with inner types or other parameters).
Add function that parses vector type parameters from a class name.
Previously, during backup, SSTable components are preserved in the
snapshot directory even after being uploaded. This leads to redundant
uploads in case of failed backups or restarts, wasting time and
resources (S3 API calls).
This change removes SSTable components from the snapshot directory once
they are successfully uploaded to the target location. This prevents
re-uploading the same files and reduces disk usage.
This change only "Refs" https://github.com/scylladb/scylladb/issues/20655, because, we can further optimize
the backup process, consider:
- Sending HEAD requests to S3 to check for existing files before uploading.
- Implementing support for resuming partially uploaded files.
Fixes https://github.com/scylladb/scylladb/issues/21799
Refs https://github.com/scylladb/scylladb/issues/20655
---
the backup API is not used in production yet, so no need to backport.
Closesscylladb/scylladb#22285
* github.com:scylladb/scylladb:
backup_task: remove a component once it is uploaded
backup_task: extract component upload logic into dedicated function
snapshot-ctl: change snapshot_ctl::run_snapshot_modify_operation() to regular func
these unused includes were identified by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.
in which, instead of using `seastarx.hh`, `readers/mutation_reader.hh`,
use `using seastar::future` to include `future` in the global namespace,
this makes `readers/mutation_reader.hh` a header exposing `future<>`,
but this is not a good practice, because, unlike `seastarx.hh` or
`seastar/core/future.hh`, `reader/mutation_reader.hh` is not
responsible for exposing seastar declarations. so, we trade the
using statement for `#include "seastarx.hh"` in that file to decouple
the source files including it from this header because of this statement.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22439
This change:
- Set liveness::LiveUpdate for audit_categories, audit_tables,
and audit_keyspaces
- Keep const reference to db::config in audit, so current config values
can be obtained by audit implementation
- Implement function audit::update_config to parse given string, update
audit datastructures when needed, and log the changes.
- Add observers to call audit::update_config when categories,
tables, or keyspaces configuration changes
Fixesscylladb/scylla-enterprise#1789
these misspellings were identified by codespell. let's fix them.
one of them is a part of a user visble string.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22443
Specialize config_from_string() for sstring to resolve lexical_cast
stream state parsing limitation. This enables correct handling of empty
string configurations, such as setting an empty value in CQL:
```cql
UPDATE system.config SET value='' WHERE
name='allowed_repair_based_node_ops';
```
Previous implementation using boost::lexical_cast would fail due to
EOF stream state, incorrectly rejecting valid empty string conversions.
Fixesscylladb/scylladb#22491
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22492
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.
The following patches are imported from the scylla enterprise:
*) Merge 'Introduce file stream for tablet' from Asias He
This patch uses Seastar RPC stream interface to stream sstable files on
network for tablet migration.
It streams sstables instead of mutation fragments. The file based
stream has multiple advantages over the mutation streaming.
- No serialization or deserialization for mutation fragments
- No need to read and process each mutation fragments
- On wire data is more compact and smaller
In the test below, a significant speed up is observed.
Two nodes, 1 shard per node, 1 initial_tablets:
- Start node 1
- Insert 10M rows of data with c-s
- Bootstrap node 2
Node 1 will migration data to node2 with the file stream.
Test results:
1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s
[shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds
2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s
[shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds
Test Summary:
File stream v.s. Mutation stream improvements
- Stream bandwidth = 836 / 128 (MB/s) = 6.53X
- Stream time = 23.60 / 1.08 (Seconds) = 21.85X
- Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X
Closes scylladb/scylla-enterprise#3438
* github.com:scylladb/scylla-enterprise:
tests: Add file_stream_test
streaming: Implement file stream for tablet
*) streaming: Use new take_storage_snapshot interface
The new take_storage_snapshot returns a file object instead of a file
name. This allows the file stream sender to read from the file even if
the file is deleted by compaction.
Closes scylladb/scylla-enterprise#3728
*) streaming: Protect unsupported file types for file stream
Currently, we assume the file streamed over the stream_blob rpc verb is
a sstable file. This patch rejects the unsupported file types on the
receiver side. This allows us to stream more file types later using the
current file stream infrastructure without worrying about old nodes
processing the new file types in the wrong way.
- The file_ops::noop is renamed to file_ops::stream_sstables to be
explicit about the file types
- A missing test_file_stream_error_injection is added to the idl
Fixes: #3846
Tests: test_unsupported_file_ops
Closesscylladb/scylla-enterprise#3847
*) idl: Add service::session_id id to idl
It will be used in the next patch.
Refs #3907
*) streaming: Protect file stream with topology_guard
Similar to "storage_service, tablets: Use session to guard tablet
streaming", this patch protects file stream with topology_guard.
Fixes#3907
*) streaming: Take service topology_guard under the try block
Taking the service::topology_guard could throw. Currently, it throws
outside the try block, so the rpc sink will not be closed, causing the
following assertion:
```
scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
seastar::rpc::sink_impl<netw::serializer,
streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
`this->_con->get()->sink_closed()' failed.
```
To fix, move more code including the topology_guard taking code to the
try block.
Fixes https://github.com/scylladb/scylla-enterprise/issues/4106Closesscylladb/scylla-enterprise#4110
*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho
We're not preserving the SSTable state across file based migration, so
staging SSTables for example are being placed into main directory, and
consequently, we're mixing staging and non-staging data, losing the
ability to continue from where the old replica left off.
It's expected that the view update backlog is transferred from old
into new replica, as migration doesn't wait for leaving replica to
complete view update work (which can take long). Elasticity is preferred.
So this fix guarantees that the state of the SSTable will be preserved
by propagating it in form of subdirectory (each subdirectory is
statically mapped with a particular state).
The staging sstables aren't being registered into view update generator
yet, as that's supposed to be fixed in OSS (more details can be found
at https://github.com/scylladb/scylladb/issues/19149).
Fixes#4265.
Closesscylladb/scylla-enterprise#4267
* github.com:scylladb/scylla-enterprise:
tablet: Preserve original SSTable state with file based tablet migration
sstables: Add get method for sstable state
*) sstable: (Re-)add shareabled_components getter
*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund
Fixes#4246
Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.
Ensures we transfer and pre-process scylla metadata for streamed
file blobs first, then properly apply receiving nodes local config
by using a source and sink layer exported from sstables, which
handles things like ordering, metadata filtering (on source) as well
as handling metadata and proper IO paths when writing data on
receiver node (sink).
This implementation maintains the statelessness of the current
design, and the delegated sink side will re-read and re-write the
metadata for each component processed. This is a little wasteful,
but the meta is small, and it is less error prone than trying to do
caching cross-shards etc. The transport is isolated from the
knowledge.
This is an alternative/complement to #4436 and #4472, fixing the
underlying issue. Note that while the layers/API:s here allows easy
fixing of other fundamental problems in the feature (such as
destination location etc), these are not included in the PR, to keep
it as close to the current behaviour as possible.
Closesscylladb/scylla-enterprise#4646
* github.com:scylladb/scylla-enterprise:
raft_tests: Copy/add a topology test with encryption
file streaming: Use sstable source/sink to transfer snapshots
sstables: Add source and sink objects + producers for transfering a snapshot
sstable::types: Add remove accessor for extension info in metadata
*) The change for error injection in merge commit 966ea5955dd8760:
File streaming now has "stream_mutation_fragments" error injection points
so test_table_dropped_during_streaming works with file streaming.
*) doc: document file-based streaming
This commit adds a description of the file-based streaming feature to the documentation.
It will be displayed in the docs using the scylladb_include_flag directive after
https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
and, in turn, branch-2024.2.
Refs https://github.com/scylladb/scylla-enterprise/issues/4585
Refs https://github.com/scylladb/scylla-enterprise/issues/4254Closesscylladb/scylla-enterprise#4587
*) doc: move File-based streaming to the Tablets source file-based-streaming
This commit moves the description of file-based streaming from a common include file
to the regular doc source file where tablets are described.
Closesscylladb/scylla-enterprise#4652
*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference
Closesscylladb/scylladb#22467
Test the simple case of base/view pairing with replication_factor
that is a multiple of the number of racks.
As well as the complex case when simple_tablets_rack_aware_view_pairing
is not possible.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Enabled with the tablets_rack_aware_view_pairing cluster feature
rack-aware pairing pairs base to view replicas that are in the
same dc and rack, using their ordinality in the replica map
We distinguish between 2 cases:
- Simple rack-aware pairing: when the replication factor in the dc
is a multiple of the number of racks and the minimum number of nodes
per rack in the dc is greater than or equal to rf / nr_racks.
In this case (that includes the single rack case), all racks would
have the same number of replicas, so we first filter all replicas
by dc and rack, retaining their ordinality in the process, and
finally, we pair between the base replicas and view replicas,
that are in the same rack, using their original order in the
tablet-map replica set.
For example, nr_racks=2, rf=4:
base_replicas = { N00, N01, N10, N11 }
view_replicas = { N11, N12, N01, N02 }
pairing would be: { N00, N01 }, { N01, N02 }, { N10, N11 }, { N11, N12 }
Note that we don't optimize for self-pairing if it breaks pairing ordinality.
- Complex rack-aware pairing: when the replication factor is not
a multiple of nr_racks. In this case, we attempt best-match
pairing in all racks, using the minimum number of base or view replicas
in each rack (given their global ordinality), while pairing all the other
replicas, across racks, sorted by their ordinality.
For example, nr_racks=4, rf=3:
base_replicas = { N00, N10, N20 }
view_replicas = { N11, N21, N31 }
pairing would be: { N00, N31 }*, { N10, N11 }, { N20, N21 }
* cross-rack pair
If we'd simply stable-sort both base and view replicas by rack,
we might end up with much worse pairing across racks:
{ N00, N11 }*, { N10, N21 }*, { N20, N31 }*
* cross-rack pair
Fixesscylladb/scylladb#17147
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, when reducing RF, we may drop replicas from
the view before dropping replicas from the base table.
Since get_view_natural_endpoint is allowed to return
a disengaged optional if it can't find a pair for the
base replica, replcace the exiting assertion with code
handling this case, and count those events in a new
table metric: total_view_updates_failed_pairing.
Note that this does not fix the root cause for the issue
which is the unsynchronized dropping of replicas, that
should be atomic, using a single group0 transaction.
Refs scylladb/scylladb#21492
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than tracking only the replica host_id, keep
track of the locator:::node& to prepare for
rack-aware pairing.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Simplify the function logic by calculating the predicate
function once, before scanning all base and view replicas,
rather than testing the different options in the inner loop.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"self-pairing" is enabled only when use_legacy_self_pairing
is enabled. That is currently unclear in the documentation
comment for this function.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently we always lookup both `my_address` and *target_endpoint
in remote_endpoints. But if my_address is in remote_endpoints
in some cases the second lookup is not needed, so do it only
to decide whether to swap target_endpoint with my_address, if
found in remote_endpoints, or to remove that match, if
*target_endpoint is already pending as well.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Although at the moment storage_service::replicate_to_all_cores
may yield between updating the base and view tables with
a new effective_replication_map, scylladb/scylladb#21781
was submitted to change that so that they are updated
atomically together.
This change prepares for the above change, and is harmless
at the moment.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
All view live in the same keyspace as their base
table, so calculate the keyspace-dependent flags
once, outside the per-view update loop.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Previously, during backup, SSTable components are preserved in the
snapshot directory even after being uploaded. This leads to redundant
uploads in case of failed backups or restarts, wasting time and
resources (S3 API calls).
This change
- adds an optional query parameter named "move_files" to
"/storage_service/backup" API. if it is set to "true", SSTable
components are removed once they are backed up to object storage.
- conditionally removes SSTable components from the snapshot directory once
they are successfully uploaded to the target location. This prevents
re-uploading the same files and reduces disk usage.
This change only "Refs" #20655, because, we can move further optimize
the backup process, consider:
- Sending HEAD requests to S3 to check for existing files before uploading.
- Implementing support for resuming partially uploaded files.
Fixes#21799
Refs #20655
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Extract upload_component() from backup_task_impl::do_backup() to improve
readability and prepare for optional post-upload cleanup. This refactoring
simplifies the main backup flow by isolating the upload logic into its own
function.
The change is motivated by an upcoming feature that will allow optional
deletion of components after successful upload, which would otherwise add
complexity to do_backup().
Refs scylladb/scylladb#21799
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of implementing `snapshot_ctl::run_snapshot_modify_operation()` as
a template function, let it accept as plain noncopyable_function
instance, and return `future<>`.
Previously, `snapshot_ctl::run_snapshot_modify_operation` was a template
function that accepted a templated functor parameter. This approach
limited its usability because callers needed to be defined in the same
translation unit as the template implementation.
however, `backup_task_impl` is defined in another translation unit,
and we intend to call `snapshot_ctl::run_snapshot_modify_operation()`
in its implementation. so in order to cater this need, there are two
options:
1. to move the definition of the template function into the header file.
but the downside is that this slows down the compilation by increaing
the size of header.
2. to change the template function to a regular function. This change
restricts the function's parameter to a specific signature. However, all
current callers already return a `future<>` object, so there's minimal
impact.
in this change, we implement the second option. this allows us to call
this function from another translation unit.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
As discussed in
https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813,
compact storage tables are deprecated.
Yet, there's is nothing in the code that prevents users
from creating such tables.
This patch adds a live-updateable config option:
`enable_create_table_with_compact_storage`, set to
`false` by default, that require users to opt-in
in order to create new tables WITH COMPACT STORAGE.
Refs scylladb/scylladb#12263, scylladb/scylladb#16375
* Since this guardrail is an enhancement, no backport is needed
Closesscylladb/scylladb#16403
* github.com:scylladb/scylladb:
docs: ddl: document the deprecation of compact tables
test: enable_create_table_with_compact_storage for tests that need it
config: add enable_create_table_with_compact_storage
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.
The following patches are imported from the scylla enterprise:
*) Merge 'Introduce file stream for tablet' from Asias He
This patch uses Seastar RPC stream interface to stream sstable files on
network for tablet migration.
It streams sstables instead of mutation fragments. The file based
stream has multiple advantages over the mutation streaming.
- No serialization or deserialization for mutation fragments
- No need to read and process each mutation fragments
- On wire data is more compact and smaller
In the test below, a significant speed up is observed.
Two nodes, 1 shard per node, 1 initial_tablets:
- Start node 1
- Insert 10M rows of data with c-s
- Bootstrap node 2
Node 1 will migration data to node2 with the file stream.
Test results:
1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s
[shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds
2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s
[shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds
Test Summary:
File stream v.s. Mutation stream improvements
- Stream bandwidth = 836 / 128 (MB/s) = 6.53X
- Stream time = 23.60 / 1.08 (Seconds) = 21.85X
- Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X
Closes scylladb/scylla-enterprise#3438
* github.com:scylladb/scylla-enterprise:
tests: Add file_stream_test
streaming: Implement file stream for tablet
*) streaming: Use new take_storage_snapshot interface
The new take_storage_snapshot returns a file object instead of a file
name. This allows the file stream sender to read from the file even if
the file is deleted by compaction.
Closes scylladb/scylla-enterprise#3728
*) streaming: Protect unsupported file types for file stream
Currently, we assume the file streamed over the stream_blob rpc verb is
a sstable file. This patch rejects the unsupported file types on the
receiver side. This allows us to stream more file types later using the
current file stream infrastructure without worrying about old nodes
processing the new file types in the wrong way.
- The file_ops::noop is renamed to file_ops::stream_sstables to be
explicit about the file types
- A missing test_file_stream_error_injection is added to the idl
Fixes: #3846
Tests: test_unsupported_file_ops
Closesscylladb/scylla-enterprise#3847
*) idl: Add service::session_id id to idl
It will be used in the next patch.
Refs #3907
*) streaming: Protect file stream with topology_guard
Similar to "storage_service, tablets: Use session to guard tablet
streaming", this patch protects file stream with topology_guard.
Fixes#3907
*) streaming: Take service topology_guard under the try block
Taking the service::topology_guard could throw. Currently, it throws
outside the try block, so the rpc sink will not be closed, causing the
following assertion:
```
scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
seastar::rpc::sink_impl<netw::serializer,
streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
`this->_con->get()->sink_closed()' failed.
```
To fix, move more code including the topology_guard taking code to the
try block.
Fixes https://github.com/scylladb/scylla-enterprise/issues/4106Closesscylladb/scylla-enterprise#4110
*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho
We're not preserving the SSTable state across file based migration, so
staging SSTables for example are being placed into main directory, and
consequently, we're mixing staging and non-staging data, losing the
ability to continue from where the old replica left off.
It's expected that the view update backlog is transferred from old
into new replica, as migration doesn't wait for leaving replica to
complete view update work (which can take long). Elasticity is preferred.
So this fix guarantees that the state of the SSTable will be preserved
by propagating it in form of subdirectory (each subdirectory is
statically mapped with a particular state).
The staging sstables aren't being registered into view update generator
yet, as that's supposed to be fixed in OSS (more details can be found
at https://github.com/scylladb/scylladb/issues/19149).
Fixes#4265.
Closesscylladb/scylla-enterprise#4267
* github.com:scylladb/scylla-enterprise:
tablet: Preserve original SSTable state with file based tablet migration
sstables: Add get method for sstable state
*) sstable: (Re-)add shareabled_components getter
*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund
Fixes#4246
Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.
Ensures we transfer and pre-process scylla metadata for streamed
file blobs first, then properly apply receiving nodes local config
by using a source and sink layer exported from sstables, which
handles things like ordering, metadata filtering (on source) as well
as handling metadata and proper IO paths when writing data on
receiver node (sink).
This implementation maintains the statelessness of the current
design, and the delegated sink side will re-read and re-write the
metadata for each component processed. This is a little wasteful,
but the meta is small, and it is less error prone than trying to do
caching cross-shards etc. The transport is isolated from the
knowledge.
This is an alternative/complement to #4436 and #4472, fixing the
underlying issue. Note that while the layers/API:s here allows easy
fixing of other fundamental problems in the feature (such as
destination location etc), these are not included in the PR, to keep
it as close to the current behaviour as possible.
Closesscylladb/scylla-enterprise#4646
* github.com:scylladb/scylla-enterprise:
raft_tests: Copy/add a topology test with encryption
file streaming: Use sstable source/sink to transfer snapshots
sstables: Add source and sink objects + producers for transfering a snapshot
sstable::types: Add remove accessor for extension info in metadata
*) The change for error injection in merge commit 966ea5955dd8760:
File streaming now has "stream_mutation_fragments" error injection points
so test_table_dropped_during_streaming works with file streaming.
*) doc: document file-based streaming
This commit adds a description of the file-based streaming feature to the documentation.
It will be displayed in the docs using the scylladb_include_flag directive after
https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
and, in turn, branch-2024.2.
Refs https://github.com/scylladb/scylla-enterprise/issues/4585
Refs https://github.com/scylladb/scylla-enterprise/issues/4254Closesscylladb/scylla-enterprise#4587
*) doc: move File-based streaming to the Tablets source file-based-streaming
This commit moves the description of file-based streaming from a common include file
to the regular doc source file where tablets are described.
Closesscylladb/scylla-enterprise#4652
*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference
Closesscylladb/scylladb#22034
As discussed in
https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813,
compact storage tables are deprecated.
Yet, there's is nothing in the code that prevents users
from creating such tables.
This patch adds a live-updateable config option:
`enable_create_table_with_compact_storage` that require users
to opt-in in order to create new tables WITH COMPACT STORAGE.
The option is currently set to `true` by default in db/config
to reduce the churn to tests and to `false` in scylla.yaml,
for new clusters.
TODO: once regressions tests that use compact storage
are converted to enable the option, change the default in
db/config to false.
A unit test was added to test/cql-pytest that
checks that the respective cql query fails as expected
with the default option or when it is explicitly set to `false`,
and that the query succeeds when the option is set to `true`.
Note that `check_restricted_table_properties` already
returns an optional warning, but it is only logged
but not returned in the `prepared_statement`.
Fixing that is out of the scope of this patch.
See https://github.com/scylladb/scylladb/issues/20945
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Introduces a comprehensive audit system to track database operations for security
and compliance purposes. This change includes:
Core Components:
- New audit subsystem for logging database operations
- Service level integration for proper resource management
- CQL statement tracking with operation categories
- Login process integration for tenant management
Key Features:
- Configurable audit logging (syslog/table)
- Operation categorization (QUERY/DML/DDL/DCL/AUTH/ADMIN)
- Selective auditing by keyspace/table
- Password sanitization in audit logs
- Service level shares support (1-1000) for workload prioritization
- Proper lifecycle management and cleanup
I ran the dtests for audit (manually enabled) and they pass.
The in-repo tests pass.
Notably, there should be no non-whitespace changes between this and scylla-enterprise
Fixesscylladb/scylla-enterprise#4999Closesscylladb/scylladb#22147
* github.com:scylladb/scylladb:
audit: Add shares support to service level management
audit: Add service level support to CQL login process
audit: Add support to CQL statements
audit: Integrate audit subsystem into Scylla main process
audit: Add documentation for the audit subsystem
audit: Add the audit subsystem
Now that all topology related code uses host ids there is not point to
maintain ip to id (and back) mappings in the token metadata. After the
patch the mapping will be maintained in the gossiper only. The rest of
the system will use host ids and in rare cases where translation is
needed (mostly for UX compatibility reasons) the translation will be
done using gossiper.
Fixes: scylladb/scylla#21777
* 'gleb/drop-ip-from-tm-v3' of github.com:scylladb/scylla-dev: (57 commits)
hint manager: do not translate ip to id in case hint manager is stopped already
locator: token_metadata: drop update_host_id() function that does nothing now
locator: topology: drop indexing by ips
repair: drop unneeded code
storage_service: use host_id to look for a node in on_alive handler
storage_proxy: translate ips to ids in forward array using gossiper
locator: topology: remove unused functions
storage_service: check for outdated ip in on_change notification in the peers table
storage_proxy: translate id to ip using address map in tablets's describe_ring code instead of taking one from the topology
topology coordinator: change connection dropping code to work on host ids
cql3: report host id instead of ip in error during SELECT FROM MUTATION_FRAGMENTS query
locator: drop unused function from tablet_effective_replication_map
api: view_build_statuses: do not use IP from the topology, but translate id to ip using address map instead
locator: token_metadata: remove unused ip based functions
locator: network_topology_strategy: use host_id based function to check number of endpoints in dcs
gossiper: drop get_unreachable_token_owners functions
storage_service: use gossiper to map ip to id in node_ops operations
storage_service: fix indentation after the last patch
storage_service: drop loops from node ops replace_prepare handling since there can be only one replacing node
token_metadata: drop no longer used functions
...
Declaring-but-not-defining a fully specialized template is a great way to
cut dependencies between users and providers, but unfortunately not
supported for variable templates. Clang 18 does support it, but
apparently it is a misinterpretation of the standard, and was removed
in clang 19.
We started using this non-feature in 7ed89266b3.
The fix is to use function templates. This is more verbose as
each specialization needs to define a static variable to return,
but is fully supported.
Closesscylladb/scylladb#22299
host_id_or_endpoint is a helper class that hold either id or ip and
translate one into another on demand. Use gossiper to do a translation
there instead of token_metadata since we want to drop ip based APIs from
the later.