The log-structured allocator collects allocation statistics (which it
uses to manage memory reserves) in some objects kept in
memtable_table_shared_data. Right now, this object is local to memtable_list,
which itself is local to a tablet replica. Move it to table scope so
different tablets in the shard share the statistics. This helps a
newly-migrated tablet adjust more quickly.
Those two members are passed from memtable_list to memtable. Since we
wish to pass them from table, it becomes awkward to pass them as two
separate variables as their contents are specific to memtable internals.
Wrap them in a name that indicates their role (being table-wide shared
data for memtables) and pass them as a unit.
Consider this:
1) file streaming takes storage snapshot = list of sstables
2) concurrent compaction unlink some of those sstables from file system
3) file streaming tries to send unlinked sstables, but files other
than data and index cannot be read as only data and index have file
descriptors opened
To fix it, the snapshot now returns a set of files, one per sstable
component, for each sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#16476
This introduces the ability to split a storage group.
The main compaction group is split into left and right groups.
set_split() is used to set the storage group to splitting mode, which
will create left and right compaction groups. Incoming writes will
now be placed into memtable of either left or right groups.
split() is used to complete the splitting of a group. It only
returns when all preexisting data is split. That means main
compaction group will be empty and all the data will be stored
in either left or right group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Storage group is the storage of tablets. This new concept is helpful
for tablet splitting, where the storage of tablet will be split
in multiple compaction groups, where each can be compacted
independently.
The reason for not going with arena concept is that it added
complexity, and it felt much more elegant to keep compaction
group unchanged which at the end of the day abstracts the concept
of a set of sstables that can be compacted and operated
independently.
When splitting, the storage group for a tablet may therefore own
multiple compaction groups, left, right, and main, where main
keeps the data that needs splitting. When splitting completes,
only left and right compaction groups will be populated.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Using consistent cluster management and not using schema commitlog
ends with a bad configuration throw during bootstrap. Soon, we
will make consistent cluster management mandatory. This forces us
to also make schema commitlog mandatory, which we do in this patch.
A booting node decides to use schema commitlog if at least one of
the two statements below is true:
- the node has `force_schema_commitlog=true` config,
- the node knows that the cluster supports the `SCHEMA_COMMITLOG`
cluster feature.
The `SCHEMA_COMMITLOG` cluster feature has been added in version
5.1. This patch is supposed to be a part of version 6.0. We don't
support a direct upgrade from 5.1 to 6.0 because it skips two
versions - 5.2 and 5.4. So, in a supported upgrade we can assume
that the version which we upgrade from has schema commitlog. This
means that we don't need to check the `SCHEMA_COMMITLOG` feature
during an upgrade.
The reasoning above also applies to Scylla Enterprise. Version
2024.2 will be based on 6.0. Probably, we will only support
an upgrade to 2024.2 from 2024.1, which is based on 5.4. But even
if we support an upgrade from 2023.x, this patch won't break
anything because 2023.1 is based on 5.2, which has schema
commitlog. Upgrades from 2022.x definitely won't be supported.
When we populate a new cluster, we can use the
`force_schema_commitlog=true` config to use schema commitlog
unconditionally. Then, the cluster feature check is irrelevant.
This check could fail because we initiate schema commitlog before
we learn about the features. The `force_schema_commitlog=true`
config is especially useful when we want to use consistent cluster
management. Failing feature checks would lead to crashes during
initial bootstraps. Moreover, there is no point in creating a new
cluster with `consistent_cluster_management=true` and
`force_schema_commitlog=false`. It would just cause some initial
bootstraps to fail, and after successful restarts, the result would
be the same as if we used `force_schema_commitlog=true` from the
start.
In conclusion, we can unconditionally use schema commitlog without
any checks in 6.0 because we can always safely upgrade a cluster
and start a new cluster.
Apart from making schema commitlog mandatory, this patch adds two
changes that are its consequences:
- making the unneeded `force_schema_commitlog` config unused,
- deprecating the `SCHEMA_COMMITLOG` feature, which is always
assumed to be true.
Closesscylladb/scylladb#16254
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Major compaction already flushes each table to make
sure it considers any mutations that are present in the
memtable for the purpose of tombstone purging.
See 64ec1c6ec6
However, tombstone purging may be inhibited by data
in commitlog segments based on `gc_time_min` in the
`tombstone_gc_state` (See f42eb4d1ce).
Flushing all sstables in the database release
all references to commitlog segments and there
it maximizes the potential for tombstone purging,
which is typically the reason for running major compaction.
However, flushing all tables too frequently might
result in tiny sstables. Since when flushing all
keyspaces using `nodetool flush` the `force_keyspace_compaction`
api is invoked for keyspace successively, we need a mechanism
to prevent too frequent flushes by major compaction.
Hence a `compaction_flush_all_tables_before_major_seconds` interval
configuration option is added (defaults to 24 hours).
In the case that not all tables are flushed prior
to major compaction, we revert to the old behavior of
flushing each table in the keyspace before major-compacting it.
Fixesscylladb/scylladb#15777
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Flushes all tables after forcing force_new_active_segment
of the commitlog to make sure all commitlog segments can
get recycled.
Otherwise, due to "false sharing", rarely-written tables
might inhibit recycling of the commitlog segments they reference.
After f42eb4d1ce,
that won't allow compaction to purge some tombstones based on
the min_gc_time.
To be used in the next patch by major compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When flushing is done externally, e.g. by running
`nodetool flush` prior to `nodetool compact`,
flush_memtables=false can be passed to skip flushing
of tables right before they are major-compacted.
This is useful to prevent creation of small sstables
due to excessive memtable flushing.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
this change is a cleanup.
to mark a return value without value semantics has no effect. these
`const` specifier useless. so let's drop them.
and, if we compile the tree with `-Wignore-qualifiers`, the compiler
would warn like:
```
/home/kefu/dev/scylladb/schema/schema.hh:245:5: error: 'const' type qualifier on return type has no effect [-Werror,-Wignored-qualifiers]
245 | const index_metadata_kind kind() const;
| ^~~~~
```
so this change also silences the above warnings.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
There's one place that does this selection, soon there will appear
another, so it's worth having a convenience helper getter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The options parameter is redundant. We always use
`_metadata->strategy_options()` and
`keyspace::create_replication_strategy` already assumes that
`_metadata` is set by using its other fields.
Closesscylladb/scylladb#15776
A memtable object contains two logalloc::allocating_section members
that track memory allocation requirements during reads and writes.
Because these are local to the memtable, each time we seal a memtable
and create a new one, these statistics are forgotten. As a result
we may have to re-learn the typical size of reads and writes, incurring
a small performance penalty.
The solution is to move the allocating_section object to the memtable_list
container. The workload is the same across all memtables of the same
table, so we don't lose discrimination here.
The performance penalty may be increased later if log changes to
memory reserve thresholds including a backtrace, so this reduces the
odds of incurring such a penalty.
Closesscylladb/scylladb#15737
This PR contains several refactoring, related to truncation records handling in `system_keyspace`, `commitlog_replayer` and `table` clases:
* drop map_reduce from `commitlog_replayer`, it's sufficient to load truncation records from the null shard;
* add a check that `table::_truncated_at` is properly initialized before it's accessed;
* move its initialization after `init_non_system_keyspaces`
Closesscylladb/scylladb#15583
* github.com:scylladb/scylladb:
system_keyspace: drop truncation_record
system_keyspace: remove get_truncated_at method
table: get_truncation_time: check _truncated_at is initialized
database: add_column_family: initialize truncation_time for new tables
database: add_column_family: rename readonly parameter to is_new
system_keyspace: move load_truncation_times into distributed_loader::populate_keyspace
commitlog_replayer: refactor commitlog_replayer::impl::init
system_keyspace: drop redundant typedef
system_keyspace: drop redundant save_truncation_record overload
table: rename cache_truncation_record -> set_truncation_time
system_keyspace: get_truncated_position -> get_truncated_positions
We want to make table::_truncated_at optional, so that in
get_truncation_time we can assert that it is initialized.
For existing tables this initialisation will happen in
load_truncation_times function, and for new tables we
want to initialize it in add_column_family like we do
with mark_ready_for_writes.
Now add_column_family function has parameter 'readonly', which is
set by the callers to false if we are creating a fresh new table
and not loading it from sstables. In this commit we rename this
parameter to is_new and invert the passed values.
This will allow us in the next commit to initialize _truncated_at field
for new tables.
When a tablet is migrated into a new home, we need to clean its
storage (i.e. the compaction group) in the old home.
This includes its presence in row cache, which can be shared by
multiple tablets living in the same shard.
For exception safety, the following is done first in a "prepare
phase" during cache invalidation.
1) take a compaction guard, to stop and disable compaction
2) flush memtable(s).
3) builds a list of all sstables, which represents all the
storage of the tablet.
Then once cache is invalidated successfully, we then clear
the sstable sets of the the group in the "execution phase",
to prevent any background op from incorrectly picking them
and also to allow for their deletion.
All the sstables of a tablet are deleted atomically, in order
to guarantee that a failure midway won't cause data resurrection
if it happens tablet is migrated back into the old home.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This is a refactoring commit without observable
changes in behaviour.
There is a truncation_record struct, but in this method we
only care about time, so rename it (and other related methods)
appropriately to avoid confusion.
Load balancer will recognize decommissioning nodes and will
move tablet replicas away from such nodes with highest priority.
Topology changes have now an extra step called "tablet draining" which
calls the load balancer. The step will execute tablet migration track
as long as there are nodes which require draining. It will not do regular
load balancing.
If load balancer is unable to find new tablet replicas, because RF
cannot be met or availability is at risk due to insufficient node
distribution in racks, it will throw an exception. Currently, topology
change will retry in a loop. We should make this error cause topology
change to be aborted. There is no infrastructure for
aborts yet, so this is not implemented.
Closes#15197
* github.com:scylladb/scylladb:
tablets, raft topology: Add support for decommission with tablets
tablet_allocator: Compute load sketch lazily
tablet_allocator: Set node id correctly
tablet_allocator: Make migration_plan a class
tablets: Implement cleanup step
storage_service, tablets: Prevent stale RPCs from running beyond their stage
locator: Introduce tablet_metadata_guard
locator, replica: Add a way to wait for table's effective_replication_map change
storage_service, tablets: Extract do_tablet_operation() from stream_tablet()
raft topology: Add break in the final case clause
raft topology: Fix SIGSEGV when trace-level logging is enabled
raft topology: Set node state in topology
raft topology: Always set host id in topology
This change adds a stub for tablet cleanup on the replica side and wires
it into the tablet migration process.
The handling on replica side is incomplete because it doesn't remove
the actual data yet. It only flushes the memtables, so that all data
is in sstables and none requires a memtable flush.
This patch is necessary to make decommission work. Otherwise, a
memtable flush would happen when the decommissioned node is put in the
drained state (as in nodetool drain) and it would fail on missing host
id mapping (node is no longer in topology), which is examined by the
tablet sharder when producing sstable sharding metadata. Leading to
abort due to failed memtable flush.
This field on the null shard is properly initialized
in maybe_init_schema_commitlog function, until then
we can't make decisions based on its value. This problem
can happen e.g. if add_column_family function is called
with readonly=false before maybe_init_schema_commitlog.
It will call commitlog_for to pass the commitlog to
mark_ready_for_writes and commitlog_for reads _uses_schema_commitlog.
In this commit we add protection against this case - we
trigger internal_error if _uses_schema_commitlog is read
before it is initialized.
maybe_init_schema_commitlog() was added to cql_test_env
to make boost tests work with the new invariant.
We want to switch system.scylla_local table to the
schema commitlog, but load phases hamper here - schema
commitlog is initialized after phase1,
so a table which is using it should be moved to phase2,
but system.scylla_local contains features, and we need
them before schema commitlog initialization for
SCHEMA_COMMITLOG feature.
In this commit we are taking a different approach to
loading system tables. First, we load them all in
one pass in 'readonly' mode. In this mode, the table
cannot be written to and has not yet been assigned
a commit log. To achieve this we've added _readonly bool field
to the table class, it's initialized to true in table's
constructor. In addition, we changed the table constructor
to always assign nullptr to commitlog, and we trigger
an internal error if table.commitlog() property is accessed
while the table is in readonly mode. Then, after
triggering on_system_tables_loaded notifications on
feature_service and sstable_format_selector, we call
system_keyspace::mark_writable and eventually
table::mark_ready_for_writes which selects the
proper commitlog and marks the table as writable.
In sstable_compaction_test we drop several
mark_ready_for_writes calls since they are redundant,
the table has already been made writable in
env.make_table_for_tests call.
The table::commitlog function either returns the current
commitlog or causes an error if the table is readonly. This
didn't work for virtual tables, since they never called
mark_ready_for_writes. In this commit we add this
call to initialize_virtual_tables.
Previously, creating a table or view in
schema_tables.cc/merge_tables_and_views was a two-step process:
first adding a column family (add_column_family function) and
then marking it as ready for writes (mark_table_as_writable).
There is an yield between these stages, this means
someone could see a table or view for which the
mark_table_as_writable method had not yet been called,
and start writing to it.
This problem was demonstrated by materialised view dtests.
A view is created on all nodes. On some nodes it will be created
earlier than on others and the view rebuild process will start
writing data to that view on other nodes, where mark_table_as_writable
has not yet been called.
In this patch we solve this problem by adding a readonly parameter
to the add_column_family method. When loading tables from disk,
this flag is set to true and the mark_table_as_writable
is called only after all sstables have been loaded.
When creating a new table, this flag is set to false,
mark_table_as_writable is called from inside add_column_family
and the new table becomes visible already as writable.
Currently, the API call recalculates only per-node schema version. To
workaround issues like #4485 we want to recalculate per-table
digests. One way to do that is to restart the node, but that's slow
and has impact on availability.
Use like this:
curl -X POST http://127.0.0.1:10000/storage_service/relocal_schemaFixes#15380Closes#15381
New file streaming for tablets will require integration with compaction
groups. So this patch introduces a way for streaming to take a storage
snapshot of a given tablet using its token range. Memtable is flushed
first, so all data of a tablet can be streamed through its sstables.
The interface is compaction group / tablet agnostic, but user can
easily pick data from a single tablet by using the range in tablet
metadata for a given tablet.
E.g.:
auto erm = table.get_effective_replication_map();
auto& tm = erm->get_token_metadata();
auto tablet_map = tm.tablets().get_tablet_map(table.schema()->id());
for (auto tid : tablet_map.tablet_ids()) {
auto tr = tmap.get_token_range(tid);
auto ssts = co_await table.take_storage_snapshot(tr);
...
}
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#15128
Compaction group is the data plane for tablets, so this integration
allows each tablet to have its own storage (memtable + sstables).
A crucial step for dynamic tablets, where each tablet can be worked
on independently.
There are still some inefficiencies to be worked on, but as it is,
it already unlocks further development.
```
INFO 2023-07-27 22:43:38,331 [shard 0] init - loading tablet metadata
INFO 2023-07-27 22:43:38,333 [shard 0] init - loading non-system sstables
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 0 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 2 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 4 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 6 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 1 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 3 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 5 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 7 present for ks.cf
```
Closes#14863
* github.com:scylladb/scylladb:
Kill scylla option to configure number of compaction groups
replica: Wire tablet into compaction group
token_metadata: Add this_host_id to topology config
replica: Switch to chunked_vector for storing compaction groups
replica: Generate group_id for compaction_group on demand
Compaction group is the data plane for tablets, so this integration
allows each tablet to have its own storage (memtable + sstables).
A crucial step for dynamic tablets, where each tablet can be worked
on independently.
There are still some inefficiencies to be worked on, but as it is,
it already unlocks further development.
INFO 2023-07-27 22:43:38,331 [shard 0] init - loading tablet metadata
INFO 2023-07-27 22:43:38,333 [shard 0] init - loading non-system sstables
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 0 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 2 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 4 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 6 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 1 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 3 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 5 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 7 present for ks.cf
There's a need for compaction_group_manager, as table will still support
"tabletless" mode, and we don't want to sprinkle ifs here and there,
to support both modes. It's not really a manager (it's not even supposed
to store a state), but I couldn't find a better name.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We aim for a large number of tablets, therefore let's switch
to chunked_vector to avoid large contiguous allocs.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
By default it's created with normal state, but there are some places
that need to put it into staging. Do it with new state enum
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two of those that call each other to end up calling plain
make_sstable() one. It's simpler to patch both if they just call the
latter directly.
While at it -- drop the unused default argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The dependency is needed by db::schema_tables to get wasm manager for
its needs. This patch prepares the ground. Now the wasm::manager is
shared between replica::database and cql3::query_processor
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All compaction task executors, except for regular compaction one,
become task manager compaction tasks.
Creating and starting of major_compaction_task_executor is modified
to be consistent with other compaction task executors.
Closes#14505
* github.com:scylladb/scylladb:
test: extend test_compaction_task.py to cover compaction group tasks
compaction: turn custom_task_executor into compaction_task_impl
compaction: turn sstables_task_executor into sstables_compaction_task_impl
compaction: change sstables compaction tasks type
compaction: move table_upgrade_sstables_compaction_task_impl
compaction: pass task_info through sstables compaction
compaction: turn offstrategy_compaction_task_executor into offstrategy_compaction_task_impl
compaction: turn cleanup_compaction_task_executor into cleanup_compaction_task_impl
comapction: use optional task info in major compaction
compaction: use perform_compaction in compaction_manager::perform_major_compaction
By default, per-table-per-shard metrics reporting is turned off, and the
aggregated version of the metrics (per-table-per-node) will be turned
on.
There could be a situation where a user with an excessive number of
tables would suffer from performance issues, both from the network and
the metrics collection server.
This patch adds a config option, enable_node_table_metrics, which allows
users to turn off per-table metrics reporting altogether.
For example, when running Scylla with the command line argument
'--enable-node-aggregated-table_metrics 0' per-table metrics will not be reported.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Maps related to column families in database are extracted
to a column_families_data class. Access to them is possible only
through methods. All methods which may preempt hold rwlock
in relevant mode, so that the iterators can't become invalid.
Fixes: #13290Closes#13349
* github.com:scylladb/scylladb:
replica: make tables_metadata's attributes private
replica: add methods to get a filtered copy of tables map
replica: add methods to check if given table exists
replica: add methods to get table or table id
replica: api: return table_id instead of const table_id&
replica: iterate safely over tables related maps
replica: pass tables_metadata to phased_barrier_top_10_counts
replica: add methods to safely add and remove table
replica: wrap column families related maps into tables_metadata
replica: futurize database::add_column_family and database::remove
cleanup_compaction_task_executor inherits both from compaction_task_executor
and cleanup_compaction_task_impl.
Add a new version of compaction_manager::perform_task_on_all_files
which accepts only the tasks that are derived from compaction_task_impl.
After all task executors' conversions are done, the new version replaces
the original one.
To make it consistent with the upcoming methods, methods triggering
major compaction get std::optional<tasks::task_info> as an argument.
Thanks to that we can distinguish between a task that has no parent
and the task which won't be registered in task manager.
Now that all users have opted in unconditionally, there is no point in
keeping this optional. Make it mandatory to make sure there are no
opt-out by mistake.
The global override via enable_compacting_data_for_streaming_and_repair
config item still remains, allowing compaction to be force turned-off.
Doing to make_multishard_streaming_reader() what the previous commit did
to make_streaming_reader(). In fact, the new compaction_time parameter
is simply forwarded to the make_streaming_reader() on the shard readers.
Call sites are updated, but none opt in just yet.
Opt-in is possible by passing an engaged `compaction_time`
(gc_clock::time_point) to the method. When this new parameter is
disengaged, no compaction happens.
Note that there is a global override, via the
enable_compacting_data_for_streaming_and_repair config item, which can
force-disable this compaction.
Compaction done on the output of the streaming reader does *not*
garbage-collect tombstones!
All call-sites are adjusted (the new parameter is not defaulted), but
none opt in yet. This will be done in separate commit per user.
The method is called by db::truncate_table_on_all_shards(), its call-chain, in turn, starts from
- proxy::remote::handle_truncate()
- schema_tables::merge_schema()
- legacy_schema_migrator
- tests
All of the above are easy to get system_keyspace reference from. This, in turn, allows making the method non-static and use query_processor reference from system_keyspace object in stead of global qctx
Closes#14778
* github.com:scylladb/scylladb:
system_keyspace: Make save_truncation_record() non-static
code: Pass sharded<db::system_keyspace>& to database::truncate()
db: Add sharded<system_keyspace>& to legacy_schema_migrator
Make _column_families and _ks_cf_to_uuid private to prevent unsafe
access. The maps can be accessed only through method which use locks
if preemption is possible.