In many cases we trigger offstrategy compaction opportunistically
also when there's nothing to do. In this case we still print
to the log lots of info-level message and call
`run_offstrategy_compaction` that wastes more cpu cycles
on learning that it has nothing to do.
This change bails out early if the maintenance set is empty
and prints a "Skipping off-strategy compaction" message in debug
level instead.
Fixes#13466
Also, add an group_id class and return it from compaction_group and table_state.
Use that to identify the compaction_group / table_state by "ks_name.cf_name compaction_group=idx/total" in log messages.
Fixes#13467Closes#13520
* github.com:scylladb/scylladb:
compaction_manager: print compaction_group id
compaction_group, table_state: add group_id member
compaction_manager: offstrategy compaction: skip compaction if no candidates are found
This series fixes a few issues caused by f1bbf705f9
(f1bbf705f9):
- table, compaction_manager: prevent cross shard access to owned_ranges_ptr
- Fixes#13631
- distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners
- compaction: make_partition_filter: do not assert shard ownership
- allow the filtering reader now used during resharding to process tokens owned by other shards
Closes#13635
* github.com:scylladb/scylladb:
compaction: make_partition_filter: do not assert shard ownership
distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners
table, compaction_manager: prevent cross shard access to owned_ranges_ptr
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.
Things achieved in this PR:
- You can start a cluster and create a keyspace whose tables will use
tablet-based replication. This is done by setting `initial_tablets`
option:
```
CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': 3,
'initial_tablets': 8};
```
All tables created in such a keyspace will be tablet-based.
Tablet-based replication is a trait, not a separate replication
strategy. Tablets don't change the spirit of replication strategy, it
just alters the way in which data ownership is managed. In theory, we
could use it for other strategies as well like
EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
is augmented to support tablets.
- You can create and drop tablet-based tables (no DDL language changes)
- DML / DQL work with tablet-based tables
Replicas for tablet-based tables are chosen from tablet metadata
instead of token metadata
Things which are not yet implemented:
- handling of views, indexes, CDC created on tablet-based tables
- sharding is done using the old method, it ignores the shard allocated in tablet metadata
- node operations (topology changes, repair, rebuild) are not handling tablet-based tables
- not integrated with compaction groups
- tablet allocator piggy-backs on tokens to choose replicas.
Eventually we want to allocate based on current load, not statically
Closes#13387
* github.com:scylladb/scylladb:
test: topology: Introduce test_tablets.py
raft: Introduce 'raft_server_force_snapshot' error injection
locator: network_topology_strategy: Support tablet replication
service: Introduce tablet_allocator
locator: Introduce tablet_aware_replication_strategy
locator: Extract maybe_remove_node_being_replaced()
dht: token_metadata: Introduce get_my_id()
migration_manager: Send tablet metadata as part of schema pull
storage_service: Load tablet metadata when reloading topology state
storage_service: Load tablet metadata on boot and from group0 changes
db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
migration_notifier: Introduce before_drop_keyspace()
migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
test: perf: Introduce perf-tablets
test: Introduce tablets_test
test: lib: Do not override table id in create_table()
utils, tablets: Introduce external_memory_usage()
db: tablets: Add printers
db: tablets: Add persistence layer
dht: Use last_token_of_compaction_group() in split_token_range_msb()
locator: Introduce tablet_metadata
dht: Introduce first_token()
dht: Introduce next_token()
storage_proxy: Improve trace-level logging
locator: token_metadata: Fix confusing comment on ring_range()
dht, storage_proxy: Abstract token space splitting
Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
db: Introduce get_non_local_vnode_based_strategy_keyspaces()
service: storage_proxy: Avoid copying keyspace name in write handler
locator: Introduce per-table replication strategy
treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
locator: Introduce effective_replication_map
locator: Rename effective_replication_map to vnode_effective_replication_map
locator: effective_replication_map: Abstract get_pending_endpoints()
db: Propagate feature_service to abstract_replication_strategy::validate_options()
db: config: Introduce experimental "TABLETS" feature
db: Log replication strategy for debugging purposes
db: Log full exception on error in do_parse_schema_tables()
db: keyspace: Remove non-const replication strategy getter
config: Reformat
This allows update_pending_ranges(), invoked on keyspace creation, to
succeed in the presence of keyspaces with per-table replication
strategy. It will update only vnode-based erms, which is intended
behavior, since only those need pending ranges updated.
This change will also make node operations like bootstrap, repair,
etc. to work (not fail) in the presence of keyspaces with per-table
erms, they will just not be replicated using those algorithms.
Before, these would fail inside get_effective_replication_map(), which
is forbidden for keyspaces with per-table replication.
It's meant to be used in places where currently
get_non_local_strategy_keyspaces() is used, but work only with
keyspaces which use vnode-based replication strategy.
Will be used by tablet-based replication strategies, for which
effective replication map is different per table.
Also, this patch adapts existing users of effective replication map to
use the per-table effective replication map.
For simplicity, every table has an effective replication map, even if
the erm is per keyspace. This way the client code can be uniform and
doesn't have to check whether replication strategy is per table.
Not all users of per-keyspace get_effective_replication_map() are
adapted yet to work per-table. Those algorithms will throw an
exception when invoked on a keyspace which uses per-table replication
strategy.
this is the first step to the uuid-based generation identifier. the goal is to encapsulate the generation related logic in generator, so its consumers do not have to understand the difference between the int64_t based generation and UUID v1 based generation.
this commit should not change the behavior of existing scylla. it just allows us to derive from `generation_generator` so we can have another generator which generates UUID based generation identifier.
Closes#13073
* github.com:scylladb/scylladb:
replica, test: create generation id using generator
sstables: add generation_generator
test: sstables: use generate_n for generating ids for testing
This series cleans up the generation and value types used in gms / gossiper.
Currently we use a blend of int, int32_t, and int64_t around messaging.
This change defines gms::generation_type and gms::version_type as int32_t
and add check in non-release modes that the respective int64 value passed over messaging do not overflow 32 bits.
Closes#12966
* github.com:scylladb/scylladb:
gossiper: version_generator: add {debug_,}validate_gossip_generation
gms: gossip_digest: use generation_type and version_type
gms: heart_beat_state: use generation_type and version_type
gms: versioned_value: use version_type
gms: version_generator: define version_type and generation_type strong types
utils: move generation-number to gms
utils: add tagged_integer
gms: versioned_value: make members private
scylla-gdb: add get_gms_versioned_value
gms: versioned_value: delete unused compare_to function
gms: gossip_digest: delete unused compare_to function
The only reason why it's there (right next to compaction_fwd.hh) is
because the database::table_truncate_state subclass needs the definition
of compaction_manager::compaction_reenabler subclass.
However, the former sub is not used outside of database.cc and can be
defined in .cc. Keeping it outside of the header allows dropping the
compaction_manager.hh from database.hh thus greatly reducing its fanout
over the code (from ~180 indirect inclusions down to ~20).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13622
When distributing the resharding jobs, prefer one of
the sstable shard owners based on foreign_sstable_open_info.
This is particularly important for uploaded sstables
that are resharded since they require cleanup.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Seen after f1bbf705f9 in debug mode
distributed_loader collect_all_shared_sstables copies
compaction::owned_ranges_ptr (lw_shared_ptr<const
dht::token_range_vector>)
across shards.
Since update_sstable_cleanup_state is synchronous, it can
be passed a const refrence to the token_range_vector instead.
It is ok to access the memory read-only across shards
and since this happens on start-up, there are no special
performance requirements.
Fixes#13631
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
and provide accessor functions to get them.
1. So they can't be modified by mistake, as the versioned value is
immutable. A new value must have a higher version.
2. Before making the version a strong gms::version_type.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
reuse generation_generator for generating generation identifiers for
less repeatings. also, add allow update generator to update its
lastest known generation id.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
There's a bunch of functions in view.{hh|cc} that don't belong to any
class and perform view-related claculations for view updates. Lots of
them eventually call view_info::select_statement() which will later need
the dictionary.
By now all those methods' callers have data dictionary at hand and can
share it via argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The caller is table with view-update-generator at hand (it calls
mutate_MV on). Builder here is used as a temporary object that destroys
once the caller coroutine co_return-s, so keeping the database obtained
from the view-update-generator is safe.
Later the v.u.b. object will propagate its data dictionary down the
callstacks.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Caller already has it to call mutate_MV() on. The method in question
will need the generator in one of the next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a legacy safety check in view code that needs to find a base table from its schema ID. To do it it calls for global storage proxy instance. The comment says that this code can be removed once computes_column feature is known by everyone. I'm not sure if that's the case, so here's more complicated yet less incompatible way to stop using global proxy instance.
Closes#13504
* github.com:scylladb/scylladb:
view: Remove unused view_ptr reference
view: Carry backing-secondary-index bit via view builder
view: Keep backing-seconday-index bool on value_getter
table: Add const index manager sgetter
enable_optimized_twcs_queries is specific to TWCS, therefore it
belongs to TWCS, not replica::table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13489
Load-and-stream reads the entire content from SSTables, therefore it can
afford to discard the bloom filter that might otherwise consume a significant
amount of memory. Bloom filters are only needed by compaction and other
replica::table operations that might want to check the presence of keys
in the SSTable files, like single-partition reads.
It's not uncommon to see Data:Filter ratio of less than 100:1, meaning
that for ~300G of data, filters will take ~3G.
In addition to saving memory footprint, it also reduces operation time
as load-and-stream no longer have to read, parse and build the filters
from disk into memory.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print classes fulfill the requirement of `FragmentedView` concept
without the help of template function of `to_hex()`, this function is
dropped in this change, as all its callers are now using fmtlib
for formatting now. the helper of `fragment_to_hex()` is dropped
as well, its only caller is `to_hex()`.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13471
This series extends sstable cleanup to resharding and other (offstrategy, major, and regular) compaction types so to:
* cleanup uploaded sstables (#11933)
* cleanup staging sstables after they are moved back to the main directory and become eligible for compaction (#9559)
When perform_cleanup is called, all sstables are scanned, and those that require cleanup are marked as such, and are added for tracking to table_state::cleanup_sstable_set. They are removed from that set once released by compaction.
Along with that sstables set, we keep the owned_ranges_ptr used by cleanup in the table_state to allow other compaction types (offstrategy, major, or regular) to cleanup those sstables that are marked as require_cleanup and that were skipped by cleanup compaction for either being in the maintenance set (requiring offstrategy compaction) or in staging.
Resharding is using a more straightforward mechanism of passing the owned token ranges when resharding uploaded sstables and using it to detect sstable that require cleanup, now done as piggybacked on resharding compaction.
Closes#12422
* github.com:scylladb/scylladb:
table: discard_sstables: update_sstable_cleanup_state when deleting sstables
compaction_manager: compact_sstables: retrieve owned ranges if required
sstables: add a printer for shared_sstable
compaction_manager: keep owned_ranges_ptr in compaction_state
compaction_manager: perform_cleanup: keep sstables in compaction_state::sstables_requiring_cleanup
compaction: refactor compaction_state out of compaction_manager
compaction: refactor compaction_fwd.hh out of compaction_descriptor.hh
compaction_manager: compacting_sstable_registration: keep a ref to the compaction_state
compaction_manager: refactor get_candidates
compaction_manager: get_candidates: mark as const
table, compaction_manager: add requires_cleanup
sstable_set: add for_each_sstable_until
distributed_loader: reshard: update sstable cleanup state
table, compaction_manager: add update_sstable_cleanup_state
compaction_manager: needs_cleanup: delete unused schema param
compaction_manager: perform_cleanup: disallow empty sorted_owened_ranges
distributed_loader: reshard: consider sstables for cleanup
distributed_loader: process_upload_dir: pass owned_ranges_ptr to reshard
distributed_loader: reshard: add optional owned_ranges_ptr param
distributed_loader: reshard: get a ref to table_state
distributed_loader: reshard: capture creator by ref
distributed_loader: reshard: reserve num_jobs buckets
compaction: move owned ranges filtering to base class
compaction: move owned_ranges into descriptor
We need to remove the deleted sstables from
update_sstable_cleanup_state otherwise their data and index
files will remain opened and their storage space won't be reclaimed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So it can be used in the next patch that will refactor
compaction_state out of class compaction_manager.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since the sstables are loaded from foreign open info
we should mark them for cleanup if needed (and owned_ranges_ptr is provided).
This will allow a later patch to enable filtering
for cleanup only for sstable sets containing
sstables that require cleanup.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
update_sstable_cleanup_state calls needs_cleanup and
inserts (or erases) the sstable into the respective
compaction_state.sstables_requiring_cleanup set.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When called from `process_upload_dir` we pass a list
of owned tokens to `reshard`. When they are available,
run resharding, with implicit cleanup, also on unshared
sstables that need cleanup.
Fixes#11933
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that reshard is a coroutine, creator is preserved
in the coroutine frame until completion so we can
simply capture it by reference now.
Note that previously it was moved into the compaction
descriptor, but the capture wasn't mutable so it was
copied anyhow and this change doesn't introduced a
regression.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
On boot it's very useful to know which storage a table comes from, so
add the respective info to existing log messages.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sharded<sys_ks> instances are plugged to large data handler and
compaction manager to maintain the circular dependency between these
components via the interposing database instance. Do the same for user
sstables manager, because S3 driver will need to update the local
ownership table.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch adds storage options lw-ptr to sstables_manager::make_sstable
and makes the storage instance creation depend on the options. For local
it just creates the filesystem storage instance, for S3 -- throws, but
next patch will fix that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The class in question will need to know the table's storage it will need
to list sstables from. For that -- construct it with the storage options
taken from table.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>