Commit Graph

1260 Commits

Author SHA1 Message Date
Kefu Chai
2a9966a20e gms: Fix fmt formatter for gossip_digest_sync
In commit 4812a57f, the fmt-based formatter for gossip_digest_syn had
formatting code for cluster_id, partitioner, and group0_id
accidentally commented out, preventing these fields from being included
in the output. This commit restores the formatting by uncommenting the
code, ensuring full visibility of all fields in the gossip_digest_syn
message when logging permits.

This fixes a regression introduced in 4812a57f, which obscured these
fields and reduced debugging insight. Backporting is recommended for
improved observability.

Fixes #23142
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23155
2025-03-07 15:36:03 +01:00
Amnon Heiman
1b64fa2283 gms/gossiper.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_gossip_heart_beat
scylla_gossip_live
scylla_gossip_unreachable

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:39 +02:00
Gleb Natapov
914c9f1711 treewide: include build_mode.hh for SCYLLA_BUILD_MODE_RELEASE where it is missing
Fixes: #22914

Closes scylladb/scylladb#22915
2025-02-20 10:50:04 +03:00
Benny Halevy
ad8b0649ff feature_service: add TABLET_OPTIONS cluster schema feature
To be used for enabling per-table tablet options.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-06 08:55:51 +02:00
Botond Dénes
af46894bb7 Merge 'Rack aware view pairing' from Benny Halevy
Enabled with the tablets_rack_aware_view_pairing cluster feature
rack-aware pairing pairs base to view replicas that are in the
same dc and rack, using their ordinality in the replica map

We distinguish between 2 cases:
- Simple rack-aware pairing: when the replication factor in the dc
  is a multiple of the number of racks and the minimum number of nodes
  per rack in the dc is greater than or equal to rf / nr_racks.

  In this case (that includes the single rack case), all racks would
  have the same number of replicas, so we first filter all replicas
  by dc and rack, retaining their ordinality in the process, and
  finally, we pair between the base replicas and view replicas,
  that are in the same rack, using their original order in the
  tablet-map replica set.

  For example, nr_racks=2, rf=4:
  base_replicas = { N00, N01, N10, N11 }
  view_replicas = { N11, N12, N01, N02 }
  pairing would be: { N00, N01 }, { N01, N02 }, { N10, N11 }, { N11, N12 }
  Note that we don't optimize for self-pairing if it breaks pairing ordinality.

- Complex rack-aware pairing: when the replication factor is not
  a multiple of nr_racks.  In this case, we attempt best-match
  pairing in all racks, using the minimum number of base or view replicas
  in each rack (given their global ordinality), while pairing all the other
  replicas, across racks, sorted by their ordinality.

  For example, nr_racks=4, rf=3:
  base_replicas = { N00, N10, N20 }
  view_replicas = { N11, N21, N31 }
  pairing would be: { N00, N31 }\*, { N10, N11 }, { N20, N21 }
  \* cross-rack pair

  If we'd simply stable-sort both base and view replicas by rack,
  we might end up with much worse pairing across racks:
  { N00, N11 }\*, { N10, N21 }\*, { N20, N31 }\*
  \* cross-rack pair

Fixes scylladb/scylladb#17147

* This is an improvement so no backport is required

Closes scylladb/scylladb#21453

* github.com:scylladb/scylladb:
  network_topology_strategy_test: add tablets rack_aware_view_pairing tests
  view: get_view_natural_endpoint: implement rack-aware pairing for tablets
  view: get_view_natural_endpoint: handle case when there are too few view replicas
  view: get_view_natural_endpoint: track replica locator::nodes
  locator: topology: consult local_dc_rack if node not found by host_id
  locator: node: add dc and rack getters
  feature_service: add tablet_rack_aware_view_pairing feature
  view: get_view_natural_endpoint: refactor predicate function
  view: get_view_natural_endpoint: clarify documentation
  view: mutate_MV: optimize remote_endpoints filtering check
  view: mutate_MV: lookup base and view erms synchronously
  view: mutate_MV: calculate keyspace-dependent flags once
2025-01-30 11:32:19 +02:00
Asias He
4018dc7f0d Introduce file stream for tablet
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.

The following patches are imported from the scylla enterprise:

*) Merge 'Introduce file stream for tablet' from Asias He

    This patch uses Seastar RPC stream interface to stream sstable files on
    network for tablet migration.

    It streams sstables instead of mutation fragments. The file based
    stream has multiple advantages over the mutation streaming.

    - No serialization or deserialization for mutation fragments
    - No need to read and process each mutation fragments
    - On wire data is more compact and smaller

    In the test below, a significant speed up is observed.

    Two nodes, 1 shard per node, 1 initial_tablets:

    - Start node 1
    - Insert 10M rows of data with c-s
    - Bootstrap node 2

    Node 1 will migration data to node2 with the file stream.

    Test results:

    1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s

    [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
	Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds

    2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s

    [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
	Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds

    Test Summary:

    File stream v.s. Mutation stream improvements

    - Stream bandwidth = 836 / 128  (MB/s)  = 6.53X

    - Stream time = 23.60 / 1.08  (Seconds) = 21.85X

    - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X

    Closes scylladb/scylla-enterprise#3438

    * github.com:scylladb/scylla-enterprise:
      tests: Add file_stream_test
      streaming: Implement file stream for tablet

*) streaming: Use new take_storage_snapshot interface

    The new take_storage_snapshot returns a file object instead of a file
    name. This allows the file stream sender to read from the file even if
    the file is deleted by compaction.

    Closes scylladb/scylla-enterprise#3728

*) streaming: Protect unsupported file types for file stream

    Currently, we assume the file streamed over the stream_blob rpc verb is
    a sstable file. This patch rejects the unsupported file types on the
    receiver side. This allows us to stream more file types later using the
    current file stream infrastructure without worrying about old nodes
    processing the new file types in the wrong way.

    - The file_ops::noop is renamed to file_ops::stream_sstables to be
      explicit about the file types

    - A missing test_file_stream_error_injection is added to the idl

    Fixes: #3846
    Tests: test_unsupported_file_ops

    Closes scylladb/scylla-enterprise#3847

*) idl: Add service::session_id id to idl

    It will be used in the next patch.

    Refs #3907

*) streaming: Protect file stream with topology_guard

    Similar to "storage_service, tablets: Use session to guard tablet
    streaming", this patch protects file stream with topology_guard.

    Fixes #3907

*) streaming: Take service topology_guard under the try block

    Taking the service::topology_guard could throw. Currently, it throws
    outside the try block, so the rpc sink will not be closed, causing the
    following assertion:

    ```
    scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
    seastar::rpc::sink_impl<netw::serializer,
    streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
    netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
    `this->_con->get()->sink_closed()' failed.
    ```

    To fix, move more code including the topology_guard taking code to the
    try block.

    Fixes https://github.com/scylladb/scylla-enterprise/issues/4106

    Closes scylladb/scylla-enterprise#4110

*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho

    We're not preserving the SSTable state across file based migration, so
    staging SSTables for example are being placed into main directory, and
    consequently, we're mixing staging and non-staging data, losing the
    ability to continue from where the old replica left off.
    It's expected that the view update backlog is transferred from old
    into new replica, as migration doesn't wait for leaving replica to
    complete view update work (which can take long). Elasticity is preferred.

    So this fix guarantees that the state of the SSTable will be preserved
    by propagating it in form of subdirectory (each subdirectory is
    statically mapped with a particular state).

    The staging sstables aren't being registered into view update generator
    yet, as that's supposed to be fixed in OSS (more details can be found
    at https://github.com/scylladb/scylladb/issues/19149).

    Fixes #4265.

    Closes scylladb/scylla-enterprise#4267

    * github.com:scylladb/scylla-enterprise:
      tablet: Preserve original SSTable state with file based tablet migration
      sstables: Add get method for sstable state

*) sstable: (Re-)add shareabled_components getter

*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund

    Fixes #4246

    Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.

    Ensures we transfer and pre-process scylla metadata for streamed
    file blobs first, then properly apply receiving nodes local config
    by using a source and sink layer exported from sstables, which
    handles things like ordering, metadata filtering (on source) as well
    as handling metadata and proper IO paths when writing data on
    receiver node (sink).

    This implementation maintains the statelessness of the current
    design, and the delegated sink side will re-read and re-write the
    metadata for each component processed. This is a little wasteful,
    but the meta is small, and it is less error prone than trying to do
    caching cross-shards etc. The transport is isolated from the
    knowledge.

    This is an alternative/complement to #4436 and #4472, fixing the
    underlying issue. Note that while the layers/API:s here allows easy
    fixing of other fundamental problems in the feature (such as
    destination location etc), these are not included in the PR, to keep
    it as close to the current behaviour as possible.

    Closes scylladb/scylla-enterprise#4646

    * github.com:scylladb/scylla-enterprise:
      raft_tests: Copy/add a topology test with encryption
      file streaming: Use sstable source/sink to transfer snapshots
      sstables: Add source and sink objects + producers for transfering a snapshot
      sstable::types: Add remove accessor for extension info in metadata

*) The change for error injection in merge commit 966ea5955dd8760:

    File streaming now has "stream_mutation_fragments" error injection points
    so test_table_dropped_during_streaming works with file streaming.

*) doc: document file-based streaming

    This commit adds a description of the file-based streaming feature to the documentation.

    It will be displayed in the docs using the scylladb_include_flag directive after
    https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
    and, in turn, branch-2024.2.

    Refs https://github.com/scylladb/scylla-enterprise/issues/4585
    Refs https://github.com/scylladb/scylla-enterprise/issues/4254

    Closes scylladb/scylla-enterprise#4587

*) doc: move File-based streaming to the Tablets source file-based-streaming

    This commit moves the description of file-based streaming from a common include file
    to the regular doc source file where tablets are described.

    Closes scylladb/scylla-enterprise#4652

*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference

Closes scylladb/scylladb#22467
2025-01-26 12:51:59 +02:00
Avi Kivity
59d3a66d18 Revert "Introduce file stream for tablet"
This reverts commit 8208688178. It was
contributed from enterprise, but is too different from the original
for me to merge back.
2025-01-22 09:42:20 +02:00
Benny Halevy
6f8f03f593 feature_service: add tablet_rack_aware_view_pairing feature
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-22 09:04:24 +02:00
Benny Halevy
88ae067ddb everywhere: add skeletal support for the in_memory_tables feature
Forward-ported from scylla-enterprise.
Note that the feature has been deprecated and the implementation
is provided only for backward compatibility with pre-existing
features and schema.

Tested manually after adding the following to feature_service:
```
    gms::feature workload_prioritization { *this, "WORKLOAD_PRIORITIZATION"sv };
```

Launched a single-node cluster running 2023.1.10
```
cqlsh> create KEYSPACE ks WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
cqlsh> create TABLE ks.test ( pk int PRIMARY KEY, val int ) WITH compaction = {'class': 'InMemoryCompactionStrategy'};
```

log:
```
Scylla version 2023.1.10-0.20241227.21cffccc1ccd with build-id bd65b8399cb13b713a87e57fe333cfcabfd50be7 starting ...
...
INFO  2024-12-27 19:45:16,563 [shard 0] migration_manager - Create new ColumnFamily: org.apache.cassandra.config.CFMetaData@0x600000f1b400[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName=ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,readRepairChance=0,dcLocalReadRepairChance=0,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,keyValidator=org.apache.cassandra.db.marshal.Int32Type,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.InMemoryCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,in_memory=false,version=5529c631-c47a-11ef-bd1d-4295734ce5a8,droppedColumns={},collections={},indices={}]
INFO  2024-12-27 19:45:16,564 [shard 0] schema_tables - Creating ks.test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ec88d510-6aff-344a-914d-541d37081440
```

Upgraded to this branch and started scylla.
Verified that ks.test was successfuly loaded:

log:
```
INFO  2024-12-27 19:48:58,115 [shard 0:main] init - Scylla version 6.3.0~dev-0.20241227.a64c6dfc153e with build-id f9496134a09cf2e55d3865b9e9ff499f672aa7da starting ...
...
WARN  2024-12-27 19:53:02,948 [shard 1:main] CompactionStrategy - InMemoryCompactionStrategy is no longer supported. Defaulting to NullCompactionStrategy.
...
INFO  2024-12-27 19:53:02,948 [shard 0:main] database - Keyspace ks: Reading CF test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ec88d510-6aff-344a-914d-541d37081440 storage=/home/bhalevy/scylladb/data/ks/test-5529c630c47a11efbd1d4295734ce5a8
```

Then, tested:
```
cqlsh> describe KEYSPACE ks;

CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true AND tablets = {'enabled': false};

CREATE TABLE ks.test (
    pk int,
    val int,
    PRIMARY KEY (pk)
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'InMemoryCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND speculative_retry = '99.0PERCENTILE';

cqlsh> alter TABLE ks.test with compaction = {'class': 'SizeTieredCompactionStrategy'};
cqlsh> describe KEYSPACE ks;

CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true AND tablets = {'enabled': false};

CREATE TABLE ks.test (
    pk int,
    val int,
    PRIMARY KEY (pk)
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND speculative_retry = '99.0PERCENTILE'
    AND tombstone_gc = {'mode': 'timeout', 'propagation_delay_in_seconds': '3600'};
```

log:
```
INFO  2024-12-27 19:56:40,465 [shard 0:stmt] migration_manager - Update table 'ks.test' From org.apache.cassandra.config.CFMetaData@0x60000362d800[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName==ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.InMemoryCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=ec88d510-6aff-344a-914d-541d37081440,droppedColumns={},collections={},indices={}] To org.apache.cassandra.config.CFMetaData@0x60000336e000[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName==ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=ecccf010-c47b-11ef-b52c-622f2f0e87c4,droppedColumns={},collections={},indices={}]
INFO  2024-12-27 19:56:40,466 [shard 0: gms] schema_tables - Altering ks.test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ecccf010-c47b-11ef-b52c-622f2f0e87c4
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#22068
2025-01-20 16:55:17 +02:00
Asias He
8208688178 Introduce file stream for tablet
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.

The following patches are imported from the scylla enterprise:

*) Merge 'Introduce file stream for tablet' from Asias He

    This patch uses Seastar RPC stream interface to stream sstable files on
    network for tablet migration.

    It streams sstables instead of mutation fragments. The file based
    stream has multiple advantages over the mutation streaming.

    - No serialization or deserialization for mutation fragments
    - No need to read and process each mutation fragments
    - On wire data is more compact and smaller

    In the test below, a significant speed up is observed.

    Two nodes, 1 shard per node, 1 initial_tablets:

    - Start node 1
    - Insert 10M rows of data with c-s
    - Bootstrap node 2

    Node 1 will migration data to node2 with the file stream.

    Test results:

    1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s

    [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
	Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds

    2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s

    [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
	Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds

    Test Summary:

    File stream v.s. Mutation stream improvements

    - Stream bandwidth = 836 / 128  (MB/s)  = 6.53X

    - Stream time = 23.60 / 1.08  (Seconds) = 21.85X

    - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X

    Closes scylladb/scylla-enterprise#3438

    * github.com:scylladb/scylla-enterprise:
      tests: Add file_stream_test
      streaming: Implement file stream for tablet

*) streaming: Use new take_storage_snapshot interface

    The new take_storage_snapshot returns a file object instead of a file
    name. This allows the file stream sender to read from the file even if
    the file is deleted by compaction.

    Closes scylladb/scylla-enterprise#3728

*) streaming: Protect unsupported file types for file stream

    Currently, we assume the file streamed over the stream_blob rpc verb is
    a sstable file. This patch rejects the unsupported file types on the
    receiver side. This allows us to stream more file types later using the
    current file stream infrastructure without worrying about old nodes
    processing the new file types in the wrong way.

    - The file_ops::noop is renamed to file_ops::stream_sstables to be
      explicit about the file types

    - A missing test_file_stream_error_injection is added to the idl

    Fixes: #3846
    Tests: test_unsupported_file_ops

    Closes scylladb/scylla-enterprise#3847

*) idl: Add service::session_id id to idl

    It will be used in the next patch.

    Refs #3907

*) streaming: Protect file stream with topology_guard

    Similar to "storage_service, tablets: Use session to guard tablet
    streaming", this patch protects file stream with topology_guard.

    Fixes #3907

*) streaming: Take service topology_guard under the try block

    Taking the service::topology_guard could throw. Currently, it throws
    outside the try block, so the rpc sink will not be closed, causing the
    following assertion:

    ```
    scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
    seastar::rpc::sink_impl<netw::serializer,
    streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
    netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
    `this->_con->get()->sink_closed()' failed.
    ```

    To fix, move more code including the topology_guard taking code to the
    try block.

    Fixes https://github.com/scylladb/scylla-enterprise/issues/4106

    Closes scylladb/scylla-enterprise#4110

*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho

    We're not preserving the SSTable state across file based migration, so
    staging SSTables for example are being placed into main directory, and
    consequently, we're mixing staging and non-staging data, losing the
    ability to continue from where the old replica left off.
    It's expected that the view update backlog is transferred from old
    into new replica, as migration doesn't wait for leaving replica to
    complete view update work (which can take long). Elasticity is preferred.

    So this fix guarantees that the state of the SSTable will be preserved
    by propagating it in form of subdirectory (each subdirectory is
    statically mapped with a particular state).

    The staging sstables aren't being registered into view update generator
    yet, as that's supposed to be fixed in OSS (more details can be found
    at https://github.com/scylladb/scylladb/issues/19149).

    Fixes #4265.

    Closes scylladb/scylla-enterprise#4267

    * github.com:scylladb/scylla-enterprise:
      tablet: Preserve original SSTable state with file based tablet migration
      sstables: Add get method for sstable state

*) sstable: (Re-)add shareabled_components getter

*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund

    Fixes #4246

    Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.

    Ensures we transfer and pre-process scylla metadata for streamed
    file blobs first, then properly apply receiving nodes local config
    by using a source and sink layer exported from sstables, which
    handles things like ordering, metadata filtering (on source) as well
    as handling metadata and proper IO paths when writing data on
    receiver node (sink).

    This implementation maintains the statelessness of the current
    design, and the delegated sink side will re-read and re-write the
    metadata for each component processed. This is a little wasteful,
    but the meta is small, and it is less error prone than trying to do
    caching cross-shards etc. The transport is isolated from the
    knowledge.

    This is an alternative/complement to #4436 and #4472, fixing the
    underlying issue. Note that while the layers/API:s here allows easy
    fixing of other fundamental problems in the feature (such as
    destination location etc), these are not included in the PR, to keep
    it as close to the current behaviour as possible.

    Closes scylladb/scylla-enterprise#4646

    * github.com:scylladb/scylla-enterprise:
      raft_tests: Copy/add a topology test with encryption
      file streaming: Use sstable source/sink to transfer snapshots
      sstables: Add source and sink objects + producers for transfering a snapshot
      sstable::types: Add remove accessor for extension info in metadata

*) The change for error injection in merge commit 966ea5955dd8760:

    File streaming now has "stream_mutation_fragments" error injection points
    so test_table_dropped_during_streaming works with file streaming.

*) doc: document file-based streaming

    This commit adds a description of the file-based streaming feature to the documentation.

    It will be displayed in the docs using the scylladb_include_flag directive after
    https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
    and, in turn, branch-2024.2.

    Refs https://github.com/scylladb/scylla-enterprise/issues/4585
    Refs https://github.com/scylladb/scylla-enterprise/issues/4254

    Closes scylladb/scylla-enterprise#4587

*) doc: move File-based streaming to the Tablets source file-based-streaming

    This commit moves the description of file-based streaming from a common include file
    to the regular doc source file where tablets are described.

    Closes scylladb/scylla-enterprise#4652

*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference

Closes scylladb/scylladb#22034
2025-01-20 16:43:21 +02:00
Botond Dénes
47989b1503 Merge 'tasks: add tablet resize virtual task' from Aleksandra Martyniuk
In this change, tablet_virtual_task starts supporting tablet
resize (i.e. split and merge).

Users can see running resize tasks - finished tasks are not
presented with the task manager API.

A new task state "suspended" is added. If a resize was revoked,
it will appear to users as suspended. We assume that the resize was revoked
when the tablet number didn't change.

Fixes: #21366.
Fixes: #21367.

No backport, new feature

Closes scylladb/scylladb#21891

* github.com:scylladb/scylladb:
  test: boost: check resize_task_info in tablet_test.cc
  test: add tests to check revoked resize virtual tasks
  test: add tests to check the list of resize virtual tasks
  test: add tests to check spilt and merge virtual tasks status
  test: test_tablet_tasks: generalize functions
  replica: service: add split virtual task's children
  replica: service: pass parent info down to storage_group::split
  tasks: children of virtual tasks aren't internal by default
  tasks: initialize shard in task_info ctor
  service: extend tablet_virtual_task::abort
  service: retrun status_helper struct from tablet_virtual_task::get_status_helper
  service: extend tablet_virtual_task::wait
  tasks: add suspended task state
  service: extend tablet_virtual_task::get_status
  service: extend tablet_virtual_task::contains
  service: extend tablet_virtual_task::get_stats
  service: add service::task_manager_module::get_nodes
  tasks: add task_manager::get_nodes
  tasks: drop noexcept from module::get_nodes
  replica: service: add resize_task_info static column to system.tablets
  locator: extend tablet_task_info to cover resize tasks
2025-01-17 14:24:07 +02:00
Gleb Natapov
0ec9f7de64 gossiper: drop get_unreachable_token_owners functions
It is used by truncate code only and even there it only check if the
returned set is not empty. Check for dead token owners in the truncation
code directly.
2025-01-16 16:37:07 +02:00
Gleb Natapov
36ccc897e8 gossiper: change get_live_members and all its users to work on host ids 2025-01-16 16:37:06 +02:00
Gleb Natapov
b3f8b579c0 gossiper: add get_endpoint_state_ptr() function that works on host id
Will be used later to simplify code.
2025-01-15 16:30:29 +02:00
Kefu Chai
7215d4bfe9 utils: do not include unused headers
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.

please note, because quite a few source files relied on
`utils/to_string.hh` to pull in the specialization of
`fmt::formatter<std::optional<T>>`, after removing
`#include <fmt/std.h>` from `utils/to_string.hh`, we have to
include `fmt/std.h` directly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-14 07:56:39 -05:00
Kamil Braun
88a48f2355 Merge 'Load peers table into the gossiper on boot' from Gleb
Since we manage ip to id mapping directly in gossiper now we need to
load the mapping on boot. We already do it anyway, but only due to a bug
which checks raft topology mode config before it is set, so the code
thinks that it is in the gossiper mode and loads peers table into the
gossiper and token metadata. Fix the bug and load peers into the gossiper
only since token metadata is managed by raft.

The series also removes address map related test that no longer checks
anything and replace it with unit test.

It also adds the dc/rack check to "join node" rpc. The check is done
during shadow round now, but for it to work it requires dc/rack to be
propagated through the gossiper and we want to eventually drop it.

Ref: scylladb/scylladb#21777

* 'load-peers' of https://github.com/gleb-cloudius/scylla:
  topology coordinator: reject replace request if topology does not match
  gossiper: fix the logic of shadow_round parameter
  storage_service: do not add endpoint to the gossiper during topology loading.
  storage_service: load peers into gossiper on boot in raft topology mode
  storage_service: set raft topology change mode before using it in join_cluster
  locator: drop inet_address usage to figure out per dc/rack replication
  test: drop test_old_ip_notification_repro.py
  test: address_map: check generation handling during entry addition
2025-01-13 09:40:36 +01:00
Aleksandra Martyniuk
18b829add8 replica: service: add resize_task_info static column to system.tablets
Add resize_task_info static column to system.tablets. Set or delete
resize_task_info value when the resize_decision is changed.
Reflect the column content in tablet_map.
2025-01-10 10:03:07 +01:00
Wojciech Mitros
d04f376227 mv: add an experimental feature for creating views using tablets
We still have a number of issues to be solved for views with tablets.
Until they are fixed, we should prevent users from creating them,
and use the vnode-based views instead.

This patch prepares the feature for enabling views with tablets. The
feature is disabled by default, but currently it has no effect.
After all tests are adjusted to use the feature, we should depend
on the feature for deciding whether we can create materialized views
in tablet-enabled keyspaces.

The unit tests are adjusted to enable this feature explicitly, and it's
also added to the scylla sstable tool config - this tool treats all
tables as if they were tablet-based (surprisingly, with SimpleStrategy),
so for it to work on views, the new feature must be enabled.

Refs scylladb/scylladb#21832

Closes scylladb/scylladb#21833
2025-01-07 15:52:36 +01:00
Gleb Natapov
7e3a196734 gossiper: fix the logic of shadow_round parameter
Currently the logic is mirrored shadow_round is true in on shadow round.
Fix it but flipping all the logic.
2025-01-02 18:44:19 +02:00
Piotr Dulikowski
ecbf8721de gms: introduce WORKLOAD_PRIORITIZATION cluster feature
Information about the number of shares per service level will be stored
in an additional column in the service levels table, which is managed
through group0. We will need the feature to make sure that all nodes in
the cluster know about the new column before any node starts applying
group0 commands the would touch the new column.

This feature also serves a role for the legacy service levels
implementation that uses system_distributed for storage: after all nodes
are upgraded to support workload prioritization, one of the nodes will
perform a schema change operation and will add the new column.
2025-01-02 07:13:34 +01:00
Gleb Natapov
c4b26ba8dc test: drop test_old_ip_notification_repro.py
The test no longer test anything since the address map is updated much
earlier now by the gossiper itself, not by the notifiers. The
functionality is tested by a unit test now.
2025-01-01 12:43:11 +02:00
Gleb Natapov
745b6d7d0d gossiper: ignore gossiper entries with local host id in gossiper mode as well
We already ignore a gossiper entries with host id equal to local host id
in raft mode since those entries are just outdated entries since before
ip change. The same logic applies to gossiper mode as well though, so do
the same in both modes.

Fixes: scylladb/scylladb#21930

Message-ID: <Z20kBZvpJ1fP9WyJ@scylladb.com>
2024-12-31 15:50:12 +02:00
Avi Kivity
76cf5148e1 Merge 'message: introduce advanced rpc compression' from Michał Chojnowski
This is a forward port (from scylla-enterprise) of additional compression options (zstd, dictionaries shared across messages) for inter-node network traffic. It works as follows:

After the patch, messaging_service (Scylla's interface for all inter-node communication)
compresses its network traffic with compressors managed by
the new advanced_rpc_compression::tracker. Those compressors compress with lz4,
but can also be configured to use zstd as long as a CPU usage limit isn't crossed.

A precomputed compression dictionary can be fed to the tracker. Each connection
handled by the tracker will then start a negotiation with the other end to switch
to this dictionary, and when it succeeds, the connection will start being compressed using that dictionary.

All traffic going through the tracker is passed as a single merged "stream" through dict_sampler.
dictionary_service has access to the dict_sampler.
On chosen nodes (in the "usual" configuration: the Raft leader), it uses the sampler to maintain
a random multi-megabyte sample of the sampler's stream. Every several minutes,
it copies the sample, trains a compression dictionary on it (by calling zstd's
training library via the alien_worker thread) and publishes the new dictionary
to system.dicts via Raft's write_mutation command.

This update triggers (eventually) a callback on all nodes, which feeds the new dictionary
to advanced_rpc_compression::tracker, and this switches (eventually) all inter-node connections
to this dictionary.

Closes scylladb/scylladb#22032

* github.com:scylladb/scylladb:
  messaging_service: use advanced_rpc_compression::tracker for compression
  message/dictionary_service: introduce dictionary_service
  service: make Raft group 0 aware of system.dicts
  db/system_keyspace: add system.dicts
  utils: add advanced_rpc_compressor
  utils: add dict_trainer
  utils: introduce reservoir_sampling
  utils: introduce alien_worker
  utils: add stream_compressor
2024-12-31 15:02:57 +02:00
Kefu Chai
6acc5294a4 treewide: migrate from boost::copy_range to std::ranges::to
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::to`.

in this change, we:

- replace `boost::copy_range` to `std::ranges::to`
- remove unused `#include` of boost headers

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21880
2024-12-26 11:46:26 +02:00
Michał Chojnowski
6a982ee0dc service: make Raft group 0 aware of system.dicts
Adds glue which causes the contents of system.dicts to be sent
in group 0 snapshots, and causes a callback to be called when
system.dicts is updated locally. The callback is currently empty
and will be hooked up to the RPC compressor tracker in one of the
next commits.
2024-12-23 23:37:02 +01:00
Avi Kivity
eb62593f2c treewide: use angle brackets when including seastar headers
We treat Seastar as a "system" library, and those are included
with angle brackets.

Closes scylladb/scylladb#21959
2024-12-20 16:16:28 +02:00
Kamil Braun
91cddcc17f Merge 'Do not reset quarantine list in non raft mode' from Gleb Natapov
The series contains small fixes to the gossiper one of which fixes #21930. Others I noticed while debugged the issue.

Fixes: scylladb/scylladb#21930

Closes scylladb/scylladb#21956

* github.com:scylladb/scylladb:
  gossiper: do not reset _just_removed_endpoints in non raft mode
  gossiper: do not send echo message to yourself
  gossiper: do not call apply for the node's old state
2024-12-19 11:03:35 +01:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Avi Kivity
5a849b0a6a Merge "Move more subsystems to use host ids instead of ips" from Gleb
"
This series converts repair, streaming and node_ops (and some parts of
alternator) to work on host ids instead of ips. This allows to remove
a lot of (but not all) functions that work on ips from effective
replication map.

CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13830/

Refs: scylladb/scylladb#21777
"

* 'gleb/move-to-host-id-more' of github.com:scylladb/scylla-dev:
  locator: topology: remove no longer use get_all_ips()
  gossiper: change get_unreachable_nodes to host ids
  locator: drop no longer used ip based functions from effective replication map and friends
  test: move network_topology_strategy_test and token_metadata_test to use host id based APIs
  replica/database: drop usage of ip in favor of host id in get_keyspace_local_ranges
  replica/mutation_dump: use host ids instead of ips
  alternator: move ttl to work with host ids instead of ips
  storage_service: move node_ops code to use host ids instead of host ips
  streaming: move streaming code to use host ids instead of host ips
  repair: move repair code to use host ids instead of host ips
  gossiper: add get_unreachable_host_ids() function
  locator: topology: add more function that return host ids to effective replication map
  locator: add more function that return host ids to effective replication map
2024-12-18 13:48:22 +02:00
Gleb Natapov
e318dfb83a gossiper: do not reset _just_removed_endpoints in non raft mode
By the time the function is called during start it may already be
populated.

Fixes: scylladb/scylladb#21930
2024-12-17 16:57:13 +02:00
Gleb Natapov
3368019982 gossiper: do not send echo message to yourself
When sending by ID we should check that we do not translate our old
address to our ID and sending locally. mark_alive should not be called
with node's old ip anyway.
2024-12-17 16:57:13 +02:00
Gleb Natapov
e80355d3a1 gossiper: do not call apply for the node's old state
If a nodes changed its address an old state may be still in a gossiper,
so ignore it.
2024-12-17 16:57:13 +02:00
Gleb Natapov
c2e3d875ab gossiper: change get_unreachable_nodes to host ids 2024-12-15 11:31:11 +02:00
Gleb Natapov
03c8ffa45c storage_service: move node_ops code to use host ids instead of host ips 2024-12-15 11:31:11 +02:00
Gleb Natapov
92815684df gossiper: add get_unreachable_host_ids() function
Will be needed later.
2024-12-15 11:31:10 +02:00
Aleksandra Martyniuk
9fad3a621a replica: service: add migration_task_info column to system.tablets
Add migration_task_info column to system.tablets. Set migration_task_info
value on migration request if the feature is enabled in the cluster.
Reflect the column content in tablet_metadata.
2024-12-11 12:07:36 +01:00
Tomasz Grabiec
8e60a0b831 Merge 'truncate: make TRUNCATE TABLE safe with tablets' from Ferenc Szili
Currently truncating a table works by issuing an RPC to all the nodes which call `database::truncate_table_on_all_shards()`, which makes sure that older writes are dropped.

It works with tablets, but is not safe. A concurrent replication process may bring back old data.

This change makes makes TRUNCATE TABLE a topology operation, so that it excludes with other processes in the system which could interfere with it. More specifically, it makes TRUNCATE a global topology request.

Backporting is not needed.

Fixes #16411

Closes scylladb/scylladb#19789

* github.com:scylladb/scylladb:
  docs: docs: topology-over-raft: Document truncate_table request
  storage_proxy: fix indentation and remove empty catch/rethrow
  test: add tests for truncate with tablets
  storage_proxy: use new TRUNCATE for tablets
  truncate: make TRUNCATE a global topology operation
  storage_service: move logic of wait_for_topology_request_completion()
  RPC: add truncate_with_tablets RPC with frozen_topology_guard
  feature_service: added cluster feature for system.topology schema change
  system.topology_requests: change schema
  storage_proxy: propagate group0 client and TSM dependency
2024-12-10 17:50:50 +01:00
Emil Maskovsky
969b396699 gossiper: fix the backward incompatible change
In the cleanup commit a840949ea0
a regression was introduced that caused backward incompatible changes
in the gossiper application state name strings.

In the e486e0f759 the value
`application_state::CDC_STREAMS_TIMESTAMP` was changed to
`application_state::CDC_GENERATION_ID`, but the name string
"CDC_STREAMS_TIMESTAMP" was kept for backward compatibility.

The cleanup commit a840949ea0 however
changed the name string to "CDC_GENERATION_ID" by ommission (not noticing
the difference) which caused backward incompatible change.

There is also another case found of "IGNOR_MSB_BITS" (that has a typo -
missing the "E" in "IGNORE") to "IGNORE_MSB_BITS", which also needs to
be reverted back to keep the backward compatibility.

Fixes: scylladb/scylladb#21811

Closes scylladb/scylladb#21813
2024-12-09 16:46:25 +01:00
Gleb Natapov
ed7ea1dc71 feature_service: fix typo in address_nodes_by_host_ids feature name
Message-ID: <Z1WYaYuQuPP8lNAX@scylladb.com>
2024-12-09 17:27:27 +02:00
Tomasz Grabiec
7e2875d648 Merge 'Add tablet merge support' from Raphael Raph Carvalho
The goal of merge is to reduce the tablet count for a shrinking table. Similar to how split increases the count while the table is growing. The load balancer decision to merge is implemented today (came with infrastructure introduced for split), but it wasn't handled until now.

Initial tablet count is respected while the table is in "growing mode". For example, the table leaves it if there was a need to split above the initial tablet count. After the table leaves the mode, the average size can be trusted to determine that the table is shrinking. Merge decision is emitted if the average tablet size is 50% of the target. Hysteresis is applied to avoid oscillations between split and merges.

Similar to split, the decision to merge is recorded in tablet map's resize_type field with the string "merge". This is important in case of coordinator failover, so new coordinator continues from where the old left off.

Unlike split, the preparation phase during merge is not done by the replica (with split compactions), but rather by the coordinator by co-locating sibling tablets in the same node's shard. We can define sibling tablets as tablets that have contiguous range and will become one after merge. The concept is based on the power-of-two constraint and token contiguity. For example, in a table with 4 tablets, tablets of ids 0 and 1 are siblings, 2 and 3 are also siblings.

The algorithm for co-locating sibling tablets is very simple. The balancer is responsible for it, and it will emit migrations so that "odd" tablet will follow the "even" one. For example, tablet 1 will be migrated to where tablet 0 lives. Co-location is low in priority, it's not the end of the world to delay merge, but it's not ideal to delay e.g. decommission or even regular load balancing as that can translate into temporary unbalancing, impacting the user activities. So co-location migrations will happen when there is no more important work to do.
While regular balancing is higher in priority, it will not undo the co-location work done so far. It does that by treating co-located tablets as if they were already merged. The load inversion convergence check was adjusted so balancer understand when two tablets are being migrated instead of one, to avoid oscillations.

When balancer completes co-location work for a table undergoing merge, it will put the id of the table into the resize_plan, which is about communicating with the topology coordinator that a table is ready for it. With all sibling tablets co-located, the coordinator can resize the tablet map (reduce it by a factor of 2) and record the new map into group0. All the replicas will react to it (on token metadata update) by merging the storage (memtable(s) + sstables) of sibling tablets into one.

Fixes #18181.

system test details:

test: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/tablets_split_merge_test.py
yaml file: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/test-cases/features/tablets/tablets-split-merge-test.yaml

instance type: i3.8xlarge
nodes: 3
target tablet size: 0.5G (scaled down by 10, to make it easier to trigger splits and merges)
description: multiple cycles of growing and shrinking the data set in order to trigger splits and merges.
data_set_size: ~100G
initial_tablets: 64, so it grew to 128 tablets on split, and back to 64 on merge.

latency of reads and writes that happened in parallel to split and merge:
```
$ for i in scylla-bench*; do cat $i | grep "Mode\|99th:\|99\.9th:"; done
Mode:			 write
  99.9th:	 3.145727ms
  99th:		 1.998847ms
  99.9th:	 3.145727ms
  99th:		 2.031615ms
Mode:			 read
  99.9th:	 3.145727ms
  99th:		 2.031615ms
  99.9th:	 3.145727ms
  99th:		 2.031615ms
Mode:			 write
  99.9th:	 3.047423ms
  99th:		 1.933311ms
  99.9th:	 3.047423ms
  99th:		 1.933311ms
Mode:			 read
  99.9th:	 3.145727ms
  99th:		 1.900543ms
  99.9th:	 3.145727ms
  99th:		 1.900543ms
Mode:			 write
  99.9th:	 5.079039ms
  99th:		 3.604479ms
  99.9th:	 35.389439ms
  99th:		 25.624575ms
Mode:			 write
  99.9th:	 3.047423ms
  99th:		 1.998847ms
  99.9th:	 3.047423ms
  99th:		 1.998847ms
Mode:			 read
  99.9th:	 3.080191ms
  99th:		 2.031615ms
  99.9th:	 3.112959ms
  99th:		 2.031615ms
```

Closes scylladb/scylladb#20572

* github.com:scylladb/scylladb:
  docs: Document tablet merging
  tests/boost: Add test to verify correctness of balancer decisions during merge
  tests/topology_experimental_raft: Add tablet merge test
  service: Handle exception when retrying split
  service: Co-locate sibling tablets for a table undergoing merge
  gms: Add cluster feature for tablet merge
  service: Make merge of resize plan commutative
  replica: Implement merging of compaction groups on merge completion
  replica: Handle tablet merge completion
  service: Implement tablet map resize for merge
  locator: Introduce merge_tablet_info()
  service: Rename topology::transition_state::tablet_split_finalization
  service: Respect initial_tablet_count if table is in growing mode
  service: Wire migration_tablet_set into the load balancer
  locator: Add tablet_map::sibling_tablets()
  service: Introduce sorted_replicas_for_tablet_load()
  locator/tablets: Extend tablet_replica equality comparator to three-way
  service: Introduce alias to per-table candidate map type
  service: Add replication constraint check variant for migration_tablet_set
  service: Add convergence check variant for migration_tablet_set
  service: Add migration helpers for migration_tablet_set
  service/tablet_allocator: Introduce migration_tablet_set
  service: Introduce migration_plan::add(migrations_vector)
  locator/tablets: Introduce tablet_map::for_each_sibling_tablets()
  locator/tablets: Introduce tablet_map::needs_merge()
  locator/tablets: Introduce resize_decision::initial_decision()
  locator/tablets: Fix return type of three-way comparison operators
  service: Extract update of node load on migrations
  service: Extract converge check for intra-node migration
  service: Extract erase of tablet replicas from candidate list
  scripts/tablet-mon: Allow visualization of tablet id
2024-12-06 18:06:20 +01:00
Abhinav
6c90a25014 Fix gossiper orphan node floating problem by adding a remover fiber
In the current scenario, if during startup, a node crashes after initiating gossip and before joining group0,
then it keeps floating in the gossiper forever because the raft based gossiper purging logic is only effective
once node joins group0. This orphan node hinders the successor node from same ip to join cluster since it collides
with it during gossiper shadow round.

This commit intends to fix this issue by adding a background thread which periodically checks for such orphan entries in
gossiper and removes them.

A test is also added in to verify this logic. This test fails without this background thread enabled, hence
verifying the behavior.

Fixes: scylladb/scylladb#20082

Closes scylladb/scylladb#21600
2024-12-06 10:45:07 +01:00
Ferenc Szili
bfbfc0fea9 feature_service: added cluster feature for system.topology schema change
This patch adds a feature serive which protects the system.topology
schema change against situations where clusters are incompletely
upgraded to new a version and could be rolled back.
2024-12-04 11:30:07 +01:00
Raphael S. Carvalho
cd5d1d3c99 gms: Add cluster feature for tablet merge
The reason we need it is that tablet merge can only be finalized
when the cluster agrees on the feature, otherwise unpatched
nodes would fail to handle merge finalization, potentially
crashing.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 21:47:01 -03:00
Avi Kivity
841481c202 Merge "move storage proxy and adjacent services to identify hosts by ids" from Gleb
"
This rather large patch series moves storage proxy and some adjacent
services (like migration manager) to use host ids to identify nodes rather
than ips. Messaging service gains a capability to address nodes by host
ids (which allows dropping translations from topology coordinator code
that worked on host ids already) and also makes sure that a node with
incorrect host id will reject a message (can happen during address
changes).

The series gets rid of the raft address map completely and replaces it with
the gossiper address map which is managed by the gossiper since translation
is now done in the layer below raft.

Fixes: scylladb/scylladb#6403

perf-simple-query -- smp 1 -m 1G output

Before:

enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
64336.82 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41291 insns/op,   24485 cycles/op,        0 errors)
62669.58 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41277 insns/op,   24695 cycles/op,        0 errors)
69172.12 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41326 insns/op,   24463 cycles/op,        0 errors)
56706.60 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41143 insns/op,   24513 cycles/op,        0 errors)
56416.65 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41186 insns/op,   24851 cycles/op,        0 errors)

         throughput: mean=61860.35 standard-deviation=5395.48 median=62669.58 median-absolute-deviation=5153.75 maximum=69172.12 minimum=56416.65
instructions_per_op: mean=41244.62 standard-deviation=76.90 median=41276.94 median-absolute-deviation=58.55 maximum=41326.19 minimum=41142.80
  cpu_cycles_per_op: mean=24601.35 standard-deviation=167.39 median=24512.64 median-absolute-deviation=116.65 maximum=24851.45 minimum=24462.70

After:

enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
65237.35 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   40733 insns/op,   23145 cycles/op,        0 errors)
59283.09 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40624 insns/op,   23948 cycles/op,        0 errors)
70851.03 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40625 insns/op,   23027 cycles/op,        0 errors)
70549.61 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40650 insns/op,   23266 cycles/op,        0 errors)
68634.96 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40622 insns/op,   22935 cycles/op,        0 errors)

         throughput: mean=66911.21 standard-deviation=4814.60 median=68634.96 median-absolute-deviation=3638.40 maximum=70851.03 minimum=59283.09
instructions_per_op: mean=40650.89 standard-deviation=47.55 median=40624.60 median-absolute-deviation=27.11 maximum=40733.37 minimum=40622.33
  cpu_cycles_per_op: mean=23264.16 standard-deviation=402.12 median=23145.29 median-absolute-deviation=237.63 maximum=23947.96 minimum=22934.59

CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13531/
SCT (longevity-100gb-4h with nemesis_selector: ['topology_changes']): https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/gleb/job/move-to-host-id/3/

Tested mixed cluster manually.
"

* 'gleb/move-to-host-id-v2' of github.com:scylladb/scylla-dev: (55 commits)
  group0: drop unused field from replace_info struct
  test: rename raft_address_map_test to address_map_test and move if from raft tests
  raft_address_map: remove raft address map
  topology coordinator: do not modify expire state for left/new nodes any more in raft address map
  topology coordinator: drop expiring entries in gossiper address map on error injections since raft one is no longer used
  group0: drop raft address map dependency from raft_rpc
  group0: move raft_ticker_type definition from raft_address_map.hh
  storage_service: do not update raft address map on gossiper events
  group0: drop raft address map dependency from raft_server_with_timeouts
  group0: move group0 upgrade code to host ids
  repair: drop raft address map dependency
  group0: remove unused raft address map getter from raft_group0
  group0: drop raft address map from group0_state_machine dependency since it is not used there any more
  group0: remove dependency on raft address map from group0_state_id_handler
  gossiper: add get_application_state_ptr that searches by host_id
  gossiper: change get_live_token_owners to return host ids
  view: move view building to host id
  hints: use host id to send hints
  storage_proxy: remove id_vector_to_addr since it is no longer used
  db: consistency_level: change is_sufficient_live_nodes to work on host ids
  ...
2024-12-03 18:18:48 +02:00
Kefu Chai
bab12e3a98 treewide: migrate from boost::adaptors::transformed to std::views::transform
now that we are allowed to use C++23. we now have the luxury of using
`std::views::transform`.

in this change, we:

- replace `boost::adaptors::transformed` with `std::views::transform`
- use `fmt::join()` when appropriate where `boost::algorithm::join()`
  is not applicable to a range view returned by `std::view::transform`.
- use `std::ranges::fold_left()` to accumulate the range returned by
  `std::view::transform`
- use `std::ranges::fold_left()` to get the maximum element in the
  range returned by `std::view::transform`
- use `std::ranges::min()` to get the minimal element in the range
  returned by `std::view::transform`
- use `std::ranges::equal()` to compare the range views returned
  by `std::view::transform`
- remove unused `#include <boost/range/adaptor/transformed.hpp>`
- use `std::ranges::subrange()` instead of `boost::make_iterator_range()`,
  to feed `std::views::transform()` a view range.

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

limitations:

there are still a couple places where we are still using
`boost::adaptors::transformed` due to the lack of a C++23 alternative
for `boost::join()` and `boost::adaptors::uniqued`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21700
2024-12-03 09:41:32 +02:00
Gleb Natapov
18a9de51e7 gossiper: add get_application_state_ptr that searches by host_id 2024-12-02 10:31:13 +02:00
Gleb Natapov
7d751709e3 gossiper: change get_live_token_owners to return host ids
Also amend the only user and drop the ip to id translation.
2024-12-02 10:31:13 +02:00
Gleb Natapov
a1fdc8c847 storage_proxy: change mutation rpcs to send forward and reply addresses as host ids
RPCs from old nodes will still use old format so translation will be
used in this case. The change is backwards compatible thanks to RPC
extensibility.
2024-12-02 10:31:12 +02:00
Gleb Natapov
b6425446c6 gossiper: fix indentation after previous patch 2024-12-02 10:31:11 +02:00
Gleb Natapov
a64b079b5c gossiper: drop advertise_myself parameter to gossiper
The parameter was needed when nodes were addressed by IP, so during
replace with the same IP a new node had to "hide" itself from the
cluster to not get accidentally confused with the old node. Now,
when nodes are addressed by host id the situation is impossible.
2024-12-02 10:31:11 +02:00