Commit Graph

36346 Commits

Author SHA1 Message Date
Tomasz Grabiec
41e69836fd db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata() 2023-04-24 10:49:37 +02:00
Tomasz Grabiec
b754433ac1 migration_notifier: Introduce before_drop_keyspace()
Tablet allocator will need to inject mutations on keyspace drop.
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
5b046043ea migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
It will be extended with listener notification firing, which is an
async operation.
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
4b4238b069 test: perf: Introduce perf-tablets
Example output:

  $ build/release/scylla perf-tablets --tables 10 --tablets-per-table $((8*1024)) --rf 3

  testlog - Total tablet count: 81920
  testlog - Size of tablet_metadata in memory: 7683 KiB
  testlog - Copied in 2.163421 [ms]
  testlog - Cleared in 0.767507 [ms]
  testlog - Saved in 774.813232 [ms]
  testlog - Read in 246.666885 [ms]
  testlog - Read mutations in 211.677292 [ms]
  testlog - Size of canonical mutations: 20.633621 [MiB]
  testlog - Disk space used by system.tablets: 0.902344 [MiB]
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
70a35f70a6 test: Introduce tablets_test 2023-04-24 10:49:37 +02:00
Tomasz Grabiec
b4ac329367 test: lib: Do not override table id in create_table()
It is already set by schema_maker. In tablets_test we will depend on
the id being the same as that set in the schema_builder, so don't
change it to something else.
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
5a24984147 utils, tablets: Introduce external_memory_usage() 2023-04-24 10:49:37 +02:00
Tomasz Grabiec
f3fbfdaa37 db: tablets: Add printers
Example:

TRACE 2023-03-30 12:06:33,918 [shard 0] tablets - Read tablet metadata: {
  8cd5b560-cee2-11ed-9cd5-7f37187f2167: {
    [0]: last_token=-6917529027641081857, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0},
    [1]: last_token=-4611686018427387905, replicas={3160b965-1925-4677-884b-c761e2bf4272:0},
    [2]: last_token=-2305843009213693953, replicas={3160b965-1925-4677-884b-c761e2bf4272:0},
    [3]: last_token=-1, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0},
    [4]: last_token=2305843009213693951, replicas={3160b965-1925-4677-884b-c761e2bf4272:0},
    [5]: last_token=4611686018427387903, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0},
    [6]: last_token=6917529027641081855, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0},
    [7]: last_token=9223372036854775807, replicas={3160b965-1925-4677-884b-c761e2bf4272:0}
  }
}
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
9d786c1ebc db: tablets: Add persistence layer 2023-04-24 10:49:37 +02:00
Tomasz Grabiec
fa8ad9a585 dht: Use last_token_of_compaction_group() in split_token_range_msb() 2023-04-24 10:49:37 +02:00
Tomasz Grabiec
fceb5f8cf6 locator: Introduce tablet_metadata
token_metadata now stores tablet metadata with information about
tablets in the system.
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
241f7febec dht: Introduce first_token() 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
462e3ffd36 dht: Introduce next_token() 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
27acf3b129 storage_proxy: Improve trace-level logging 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
34a9c62ae5 locator: token_metadata: Fix confusing comment on ring_range()
It could be interpreted to mean that the search token is excluded.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
e4865bd4d1 dht, storage_proxy: Abstract token space splitting
Currently, scans are splitting partition ranges around tokens. This
will have to change with tablets, where we should split at tablet
boundaries.

This patch introduces token_range_splitter which abstracts this
task. It is provided by effective_replication_map implementation.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
b769c4ee55 Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
This reverts commit 95bf8eebe0.

Later patches will adapt this code to work with token_range_splitter,
and the unit test added by the reverted commit will start to fail.

The unit test asks the query_ranges_to_vnodes_generator to split the range:

   [t:end, t+1:start)

around token t, and expects the generator to produce an empty range

   [t:end, t:end]

After adapting this code to token_range_splitter, the input range will
not be split because it is recognized as adjacent to t:end, and the
optimization logic will not kick in. Rather than adding more logic to
handle this case, I think it's better to drop the optimization, as it
is not very useful (rarely happens) and not required for correctness.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
94e1c7b859 db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
This allows update_pending_ranges(), invoked on keyspace creation, to
succeed in the presence of keyspaces with per-table replication
strategy. It will update only vnode-based erms, which is intended
behavior, since only those need pending ranges updated.

This change will also make node operations like bootstrap, repair,
etc. to work (not fail) in the presence of keyspaces with per-table
erms, they will just not be replicated using those algorithms.

Before, these would fail inside get_effective_replication_map(), which
is forbidden for keyspaces with per-table replication.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
dc04da15ec db: Introduce get_non_local_vnode_based_strategy_keyspaces()
It's meant to be used in places where currently
get_non_local_strategy_keyspaces() is used, but work only with
keyspaces which use vnode-based replication strategy.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
8fcb320e71 service: storage_proxy: Avoid copying keyspace name in write handler 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
9b17ad3771 locator: Introduce per-table replication strategy
Will be used by tablet-based replication strategies, for which
effective replication map is different per table.

Also, this patch adapts existing users of effective replication map to
use the per-table effective replication map.

For simplicity, every table has an effective replication map, even if
the erm is per keyspace. This way the client code can be uniform and
doesn't have to check whether replication strategy is per table.

Not all users of per-keyspace get_effective_replication_map() are
adapted yet to work per-table. Those algorithms will throw an
exception when invoked on a keyspace which uses per-table replication
strategy.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
5d9bcb45de treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
bb297d86a0 locator: Introduce effective_replication_map
With tablet-based replication strategies it will represent replication
of a single table.

Current vnode_effective_replication_map can be adapted to this interface.

This will allow algorithms like those in storage_proxy to work with
both kinds of replication strategies over a single abstraction.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
d3c9ad4ed6 locator: Rename effective_replication_map to vnode_effective_replication_map
In preparation for introducing a more abstract
effective_replication_map which can describe replication maps which
are not based on vnodes.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
1343bfa708 locator: effective_replication_map: Abstract get_pending_endpoints() 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
7b01fe8742 db: Propagate feature_service to abstract_replication_strategy::validate_options()
Some replication strategy options may be feature-dependent.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
9781d3ffc5 db: config: Introduce experimental "TABLETS" feature 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
a892e144cc db: Log replication strategy for debugging purposes 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
7543c75b62 db: Log full exception on error in do_parse_schema_tables() 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
c923bdd222 db: keyspace: Remove non-const replication strategy getter
Keyspace will store replication_ptr, which is a const pointer. No user
needs a mutable reference.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
bf2ce8ff75 config: Reformat 2023-04-24 10:49:36 +02:00
Chang Chen Chien
c25a718008 docs: fix typo in using-scylla/local-secondary-indexes.rst
Closes #13607
2023-04-24 06:56:19 +03:00
Pavel Emelyanov
5e201b9120 database: Remove compaction_manager.hh inclusion into database.hh
The only reason why it's there (right next to compaction_fwd.hh) is
because the database::table_truncate_state subclass needs the definition
of compaction_manager::compaction_reenabler subclass.

However, the former sub is not used outside of database.cc and can be
defined in .cc. Keeping it outside of the header allows dropping the
compaction_manager.hh from database.hh thus greatly reducing its fanout
over the code (from ~180 indirect inclusions down to ~20).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13622
2023-04-23 16:27:11 +03:00
Tomasz Grabiec
bd0b299322 Merge 'Manage CDC generations when bootstrapping nodes using Raft Group 0 topology coordinator' from Kamil Braun
Introduce a new table `CDC_GENERATIONS_V3` (`system.cdc_generations_v3`).
The table schema is a copy-paste of the `CDC_GENERATIONS_V2` schema. The
difference is that V2 lives in `system_distributed_keyspace` and writes to it
are distributed using regular `storage_proxy` replication mechanisms based on
the token ring.  The V3 table lives in `system_keyspace` and any mutations
written to it will go through group 0.

Extend the `TOPOLOGY` schema with new columns:
- `new_cdc_generation_data_uuid` will be stored as part of a bootstrapping
  node's `ring_slice`, it stores UUID of a newly introduced CDC
  generation which is used as partition key for the `CDC_GENERATIONS_V3`
  table to access this new generation's data. It's a regular column,
  meaning that every row (corresponding to a node) will have its own.
- `current_cdc_generation_uuid` and `current_cdc_generation_timestamp`
  together form the ID of the newest CDC generation in the cluster.
  (the uuid is the data key for `CDC_GENERATIONS_V3`, the timestamp is
  when the CDC generation starts operating). Those are static columns
  since there's a single newest CDC generation.

When topology coordinator handles a request for node to join, calculate a new
CDC generation using the bootstrapping node's tokens, translate it to mutation
format, and insert this mutation to the CDC_GENERATIONS_V3 table through group 0
at the same time we assign tokens to the node in Raft topology. The partition
key for this data is stored in the bootstrapping node's `ring_slice`.

After inserting new CDC generation data , we need to pick a timestamp for this
generation and commit it, telling all nodes in the cluster to start using the
generation for CDC log writes once their clocks cross that timestamp.

We introduce a separate step to the bootstrap saga, before
`write_both_read_old`, called `commit_cdc_generation`. In this step, the
coordinator takes the `new_cdc_generation_data_uuid` stored in a bootstrapping
node's `ring_slice` - which serves as the key to the table where the CDC
generation data is stored - and combines it with a timestamp which it generates
a bit into the future (as in old gossiper-based code, we use 2 * ring_delay, by
default 1 minute). This gives us a CDC generation ID which we commit into the
topology state as the `current_cdc_generation_id` while switching the saga to
the next step, `write_both_read_old`.

Once a new CDC generation is committed to the cluster by the topology
coordinator, we also need to publish it to the user-facing description tables so
CDC applications know which streams to read from.

This uses regular distributed table writes underneath (tables living in the
`system_distributed` keyspace) so it requires `token_metadata` to be nonempty.
We need a hack for the case of bootstrapping the first node in the cluster -
turning the tokens into normal tokens earlier in the procedure in
`token_metadata`, but this is fine for the single-node case since no streaming
is happening.

When a node notices that a new CDC generation was introduced in
`storage_service::topology_state_load`, it updates its internal data structures
that are used when coordinating writes to CDC log tables.

We include the current CDC generation data in topology snapshot transfers.

Some fixes and refactors included.

Closes #13385

* github.com:scylladb/scylladb:
  docs: cdc: describe generation changes using group 0 topology coordinator
  cdc: generation_service: add a FIXME
  cdc: generation_service: add legacy_ prefix for gossiper-based functions
  storage_service: include current CDC generation data in topology snapshots
  db: system_keyspace: introduce `query_mutations` with range/slice
  storage_service: hold group 0 apply mutex when reading topology snapshot
  service: raft_group0_client: introduce `hold_read_apply_mutex`
  storage_service: use CDC generations introduced by Raft topology
  raft topology: publish new CDC generation to the user description tables
  raft topology: commit a new CDC generation on node bootstrap
  raft topology: create new CDC generation data during node bootstrap
  service: topology_state_machine: make topology::find const
  db: system_keyspace: small refactor of `load_topology_state`
  cdc: generation: extract pure parts of `make_new_generation` outside
  db: system_keyspace: add storage for CDC generations managed by group 0
  service: topology_state_machine: better error checking for state name (de)serialization
  service: raft: plumbing `cdc::generation_service&`
  cdc: generation: `get_cdc_generation_mutations`: take timestamp as parameter
  cdc: generation: make `topology_description_generator::get_sharding_info` a parameter
  sys_dist_ks: make `get_cdc_generation_mutations` public
  sys_dist_ks: move find_schema outside `get_cdc_generation_mutations`
  sys_dist_ks: move mutation size threshold calculation outside `get_cdc_generation_mutations`
  service/raft: group0_state_machine: signal topology state machine in `load_snapshot`
2023-04-21 18:11:27 +02:00
Anna Stuchlik
a68b976c91 doc: document tombstone_gc as not experimental
The tombstone_gc was documented as experimental in version 5.0.
It is no longer experimental in version 5.2.
This commit updates the information about the option.

Closes #13469
2023-04-21 14:43:25 +02:00
Botond Dénes
fcd7f6ac5f Update tools/java submodule
* tools/java c9be8583...eb3c43f8 (1):
  > Use EstimatedHistogram in metricPercentilesAsArray
2023-04-21 14:31:38 +03:00
Kefu Chai
a2aa133822 treewide: use std::lexicographical_compare_threeway
this the standard library offers
`std::lexicographical_compare_threeway()`, and we never uses the
last two addition parameters which are not provided by
`std::lexicographical_compare_threeway()`. there is no need to have
the homebrew version of trichotomic compare function.

in this change,

* all occurrences of `lexicographical_tri_compare()` are replaced
  with `std::lexicographical_compare_threeway()`.
* ``lexicographical_tri_compare()` is dropped.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13615
2023-04-21 14:28:18 +03:00
Kefu Chai
51fc0bc698 sstables: use default generated operator==
C++20 compiler is able to generate defaulted operator== and
operator!=. and the default generated operators behaves exactly
the same as the ones crafted by us. so let's it do its job.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13614
2023-04-21 14:25:39 +03:00
Botond Dénes
10c1f1dc80 Merge 'db: system_keyspace: use microsecond resolution for group0_history range tombstone' from Kamil Braun
in `make_group0_history_state_id_mutation`, when adding a new entry to
the group 0 history table, if the parameter `gc_older_than` is engaged,
we create a range tombstone in the mutation which deletes entries older
than the new one by `gc_older_than`. In particular if
`gc_older_than = 0`, we want to delete all older entries.

There was a subtle bug there: we were using millisecond resolution when
generating the tombstone, while the provided state IDs used microsecond
resolution. On a super fast machine it could happen that we managed to
perform two schema changes in a single millisecond; this happened
sometimes in `group0_test.test_group0_history_clearing_old_entries`
on our new CI/promotion machines, causing the test to fail because the
tombstone didn't clear the entry correspodning to the previous schema
change when performing the next schema change (since they happened in
the same millisecond).

Use microsecond resolution to fix that. The consecutive state IDs used
in group 0 mutations are guaranteed to be strictly monotonic at
microsecond resolution (see `generate_group0_state_id` in
service/raft/raft_group0_client.cc).

Fixes #13594

Closes #13604

* github.com:scylladb/scylladb:
  db: system_keyspace: use microsecond resolution for group0_history range tombstone
  utils: UUID_gen: accept decimicroseconds in min_time_UUID
2023-04-21 14:08:56 +03:00
Kamil Braun
55f43e532c Merge 'get rid of gms/failure_detector' from Benny Halevy
Move gms::arrival_window to api/failure_detector which is its only user.
and get rid of the rest, which is not used, now that we use direct_failure_detector instead.

TODO: integare direct_failure_detector with failure_detector api.

Closes #13576

* github.com:scylladb/scylladb:
  gms: get rid of unused failure_detector
  api: failure_detector: remove false dependency on failure_detector::arrival_window
  test: rest_api: add test_failure_detector
2023-04-21 11:47:44 +02:00
Kamil Braun
f7408130c9 Merge 'Fix topology management when raft-based topology is enabled' from Tomasz Grabiec
Fixes a problem when raft-based topology is enabled, which loads
topology from storage. It starts by clearing topology and then adding
nodes one by one. Before this patch, this violates internal invariant
of topology object which puts the local node as the first node. This
would manifest by triggering an assert in topology::pop_node() which
throws if popping the node at index 0 in order to keep the information
about local node around. This is normally prevented by a check in
topology::remove_node() which avoid calling pop_node() if removing the
local node. But since there is no node which is marked as local, this
check allows the first node to be popped.

To fix the problem I lift the invariant that local node is always in
_nodes. We still have information about local node in config. Instead
of keeping it in _nodes, we recognize it as part of indexing. We also
allow removing the local node like a regular node.

The path which reloads topology works correctly after this, the local
node will be recognized when (if) it is added to the topology.

Fixes #13495

Closes #13498

* github.com:scylladb/scylladb:
  locator: topology: Fix move assignment
  locator: topology: Add printer
  tests: topology: Test that topology clearing preserves information about local node
  locator: topology: Recognize local node as part of indexing it
  locator: topology: Fix get_location(ep) for local node
  locator: topology: Fix typo
  locator: topology: Preserve config when cloning
2023-04-21 11:45:08 +02:00
Alejo Sanchez
ce87aedd30 test: topology smp test with custom cluster
Instead of decommission of initial cluster, use custom cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13589
2023-04-21 10:43:54 +02:00
Kamil Braun
f9d8118c8d db: system_keyspace: use microsecond resolution for group0_history range tombstone
in `make_group0_history_state_id_mutation`, when adding a new entry to
the group 0 history table, if the parameter `gc_older_than` is engaged,
we create a range tombstone in the mutation which deletes entries older
than the new one by `gc_older_than`. In particular if
`gc_older_than = 0`, we want to delete all older entries.

There was a subtle bug there: we were using millisecond resolution when
generating the tombstone, while the provided state IDs used microsecond
resolution. On a super fast machine it could happen that we managed to
perform two schema changes in a single millisecond; this happened
sometimes in `group0_test.test_group0_history_clearing_old_entries`
on our new CI/promotion machines, causing the test to fail because the
tombstone didn't clear the entry correspodning to the previous schema
change when performing the next schema change (since they happened in
the same millisecond).

Use microsecond resolution to fix that. The consecutive state IDs used
in group 0 mutations are guaranteed to be strictly monotonic at
microsecond resolution (see `generate_group0_state_id` in
service/raft/raft_group0_client.cc).

Fixes #13594
2023-04-21 10:33:05 +02:00
Kamil Braun
218a056825 utils: UUID_gen: accept decimicroseconds in min_time_UUID
The function now accepts higher-resolution duration types, such as
microsecond resolution timestamps. Will be used by the next commit.
2023-04-21 10:33:02 +02:00
Kefu Chai
ca6ebbd1f0 cql3, db: sstable: specialize fmt::formatter<function_name>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `function_name` without the help of `operator<<`.

the corresponding `operator<<()` are dropped dropped in this change,
as all its callers are now using fmtlib for formatting now.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13608
2023-04-21 10:07:28 +03:00
Botond Dénes
d74f3598f4 Merge 'dht: specialize fmt::formatter<dht::token>' from Kefu Chai
this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `dht::token` without the help of `operator<<`.

the corresponding `operator<<()` is preserved in this change, as it has lots of users in this project, we will tackle them case-by-case in follow-up changes.

also, the forward declaration of `operator<<(ostream&, constdht::token&)` in `dht/i_partitioner.hh` is removed. ias it not necessary.

Refs https://github.com/scylladb/scylladb/issues/13245

Closes #13610

* github.com:scylladb/scylladb:
  dht: remove unnecessarily forward declaration
  dht: specialize fmt::formatter<dht::token>
2023-04-21 09:51:25 +03:00
Kefu Chai
c5fa1ac9f7 sstable: specialize fmt::formatter<component_type>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `component_type` without the help of `operator<<`.

the corresponding `operator<<()` are dropped dropped in this change,
as all its callers are now using fmtlib for formatting now.

also, please note, to enable fmtlib to format `std::set<component_type>`
in `test/boost/sstable_3_x_test.cc` , we need to include
`<fmt/ranges.h>` in that source file.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13598
2023-04-21 09:49:24 +03:00
Kefu Chai
9215adee46 streaming: specialize fmt::formatter<stream_reason>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `stream_reason` without the help of `operator<<`.

please note, because we still cannot use the generic formatter for
std::unordered_map provided by fmtlib, so in order to drop `operator<<`
for `stream_reason`, and to print `unordered_map<stream_reason>`,
`fmt::join()` is used as a temporary solution. we will audit all
`fmt::join()` calls, after removing the homebrew formatter of
`std::unordered_map`.

the corresponding `operator<<()` are dropped dropped in this change,
as all its callers are now using fmtlib for formatting now.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13609
2023-04-21 09:44:23 +03:00
Kefu Chai
ecb5380638 treewide: s/boost::lexical_cast<std::string>/fmt::to_string()/
this change replaces all occurrences of `boost::lexical_cast<std::string>`
in the source tree with `fmt::to_string()`. for couple reasons:

* `boost::lexical_cast<std::string>` is longer than `fmt::to_string()`,
  so the latter is easier to parse and read.
* `boost::lexical_cast<std::string>` creates a stringstream under the
  hood, so it can use the `operator<<` to stringify the given object.
  but stringstream is known to be less performant than fmtlib.
* we are migrating to fmtlib based formatting, see #13245. so
  using `fmt::to_string()` helps us to remove yet another dependency
  on `operator<<`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13611
2023-04-21 09:43:53 +03:00
Benny Halevy
3f1ac846d8 gms: get rid of unused failure_detector
The legacy failure_detector is now unused and can be removed.

TODO: integare direct_failure_detector with failure_detector api.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-21 09:08:27 +03:00