timestamp_func
Since in the drop_table case we want to discard ALL
sstables in the table, not only those with `max_data_age()`
up until drop started.
Fixes#11232
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before we change drop_table_on_all_shards to always
pass db_clock::time_point::max() in the next patch,
let it pass a unique snapshot name, otherwise
the snapshot name will always be based on the constant, max
time_point.
Refs #11232
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Introduce a `remote` class that handles all remote communication in `storage_proxy`: sending and receiving RPCs, checking the state of other nodes by accessing the gossiper, and fetching schema.
The `remote` object lives inside `storage_proxy` and right now it's initialized and destroyed together with `storage_proxy`.
The long game here is to split the initialization of `storage_proxy` into two steps:
- the first step, which constructs `storage_proxy`, initializes it "locally" and does not require references to `messaging_service` and `gossiper`.
- the second step will take those references and add the `remote` part to `storage_proxy`.
This will allow us to remove some cycles from the service (de)initialization order and in general clean it up a bit. We'll be able to start `storage_proxy` right after the `database` (without messaging/gossiper). Similar refactors are planned for `query_processor`.
Closes#11088
* github.com:scylladb/scylladb:
service: storage_proxy: pass `migration_manager*` to `init_messaging_service`
service: storage_proxy: `remote`: make `_gossiper` a const reference
gms: gossiper: mark some member functions const
db: consistency_level: `filter_for_query`: take `const gossiper&`
replica: table: `get_hit_rate`: take `const gossiper&`
gms: gossiper: move `endpoint_filter` to `storage_proxy` module
service: storage_proxy: pass `shared_ptr<gossiper>` to `start_hints_manager`
service: storage_proxy: establish private section in `remote`
service: storage_proxy: remove `migration_manager` pointer
service: storage_proxy: remove calls to `storage_proxy::remote()` from `remote`
service: storage_proxy: remove `_gossiper` field
alternator: ttl: pass `gossiper&` to `expiration_service`
service: storage_proxy: move `truncate_blocking` implementation to `remote`
service: storage_proxy: introduce `is_alive` helper
service: storage_proxy: remove `_messaging` reference
service: storage_proxy: move `connection_dropped` to `remote`
service: storage_proxy: make `encode_replica_exception_for_rpc` a static function
service: storage_proxy: move `handle_write` to `remote`
service: storage_proxy: move `handle_paxos_prune` to `remote`
service: storage_proxy: move `handle_paxos_accept` to `remote`
service: storage_proxy: move `handle_paxos_prepare` to `remote`
service: storage_proxy: move `handle_truncate` to `remote`
service: storage_proxy: move `handle_read_digest` to `remote`
service: storage_proxy: move `handle_read_mutation_data` to `remote`
service: storage_proxy: move `handle_read_data` to `remote`
service: storage_proxy: move `handle_mutation_failed` to `remote`
service: storage_proxy: move `handle_mutation_done` to `remote`
service: storage_proxy: move `handle_paxos_learn` to `remote`
service: storage_proxy: move `receive_mutation_handler` to `remote`
service: storage_proxy: move `handle_counter_mutation` to `remote`
service: storage_proxy: remove `get_local_shared_storage_proxy`
service: storage_proxy: (de)register RPC handlers in `remote`
service: storage_proxy: introduce `remote`
And pass a reference to it to the database rather
than having the database construct its own compaction_manager.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This series is the first step in the effort to reduce the number of metrics reported by Scylla.
The series focuses on the per-table metrics.
The combination of histograms, per-tables, and per shard makes the number of metrics in a cluster explode.
The following series uses multiple tools to reduce the number of metrics.
1. Multiple metrics should only be reported for the user tables and the condition that checked it was not updated when more non-user keyspaces were added.
2. Second, instead of a histogram, per table, per shard, it will report a summary per table, per shard, and a single histogram per node.
3. Histograms, summaries, and counters will be reported only if they are used (for example, the cas-related metrics will not be reported for tables that are not using cas).
Closes#11058
* github.com:scylladb/scylla:
Add summary_test
database: Reduce the number of per-table metrics
replica/table.cc: Do not register per-table metrics for system
histogram_metrics_helper.hh: Add to_metrics_summary function
Unified histogram, estimated_histogram, rates, and summaries
Split the timed_rate_moving_average into data and timer
utils/histogram.hh: should_sample should use a bitmask
estimated_histogram: add missing getter method
This patch reduces the number of metrics that is reported per table, when
the per-table flag is on.
When possible, it moves from time_estimated_histogram and
timed_rate_moving_average_and_histogram to use the unified timer.
Instead of a histogram per shard, it will now report a summary per shard
and a histogram per node.
Counters, histograms, and summaries will not be reported if they were
never used.
The API was updated accordingly so it would not break.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
And let seal_active_memtable decide about how to handle them
as now all flush error handling logic is implemented there.
In particular, unlike today, sstable write errors will
cause internal error rather than loop forever.
Also, check for shutdown earlier to ignore errors
like semaphore_broken that might happen when
the table is stopped.
Refs #10498
(The issue will be considered fixed when going
into maintenance mode on write errors rather than
throwing internal error and potentially retrying forever)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently flush is retried both by dirty_memory_manager::flush_when_needed
and table::seal_active_memtable, which may be called by other paths
like table::flush.
Unify the retry logic into seal_active_memtable so that
we have similar error handling semantics on all paths.
Refs #4174
Refs #10498
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that everything prior to flush_one is noexcept
make table::seal_active_memtable and the paths that call it
noexcept, making sure that any errors are returned only
as exceptional futures, and handle them in flush_when_needed().
The original handle_exception had a broader scope than now needed,
so this change is mostly technical, to show that we can narrow down
the error handling to the continuation of flush_one - and verify that
the unit test is not broken.
A later patch moves this error handling logic away to seal_active_memtable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
region_group is an abstraction that allows accounting for groups of
regions, but the cost/benefit ratio of maintaining the abstraction
is poor. Each time we need to change decision algorithm of memtable
flushing (admittedly rarely), we need to distill that into an abstraction
for region_groups and then use it. An example is virtual regions groups;
we wanted to account for the partially flushed memtables and had to
invent region groups to stand in their place.
Rather than continuing to invest in the abstraction, break it now
and move it to the memtable dirty memory manager which is responsible
for making those decisions. The relevant code is moved to
dirty_memory_manager.hh and dirty_memory_manager.cc (new file), and
a new unit test file is added as well.
A downside of the change is that unit testing will be more difficult.
This patch makes memtable_flush_static_shares liveupdateable
to avoid having to restart the cluster after updating
this config.
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
This work gets us a step closer to compaction groups.
Everything in compaction layer but compaction_manager was converted to table_state.
After this work, we can start implementing compaction groups, as each group will be represented by its own table_state. User-triggered operations that span the entire table, not only a group, can be done by calling the manager operation on behalf of each group and then merging the results, if any.
Closes#11028
* github.com:scylladb/scylla:
compaction: remove forward declaration of replica::table
compaction_manager: make add() and remove() switch to table_state
compaction_manager: make run_custom_job() switch to table_state
compaction_manager: major: switch to table_state
compaction_manager: scrub: switch to table_state
compaction_manager: upgrade: switch to table_state
compaction: table_state: add get_sstables_manager()
compaction_manager: cleanup: switch to table_state
compaction_manager: offstrategy: switch to table_state()
compaction_manager: rewrite_sstables(): switch to table_state
compaction_manager: make run_with_compaction_disabled() switch to table_state
compaction_manager: compaction_reenabler: switch to table_state
compaction_manager: make submit(T) switch to table_state
compaction_manager: task: switch to table_state
compaction: table_state: Add is_auto_compaction_disabled_by_user()
compaction: table_state: Add on_compaction_completion()
compaction: table_state: Add make_sstable()
compaction_manager: make can_proceed switch to table_state
compaction_manager: make stop compaction procedures switch to table_state
compaction_manager: make get_compactions() switch to table_state
compaction_manager: change task::update_history() to use table_state instead
compaction_manager: make can_register_compaction() switch to table_state
compaction_manager: make get_candidates() switch to table_state
compaction_manager: make propagate_replacement() switch to table_state
compaction: Move table::in_strategy_sstables() and switch to table_state
compaction: table_state: Add maintenance sstable set
compaction_manager: make has_table_ongoing_compaction() switch to table_state
compaction_manager: make compaction_disabled() switch to table_state
compaction_manager: switch to table_state for mapping of compaction_state
compaction_manager: move task ctor into source
Runs drop_column_family on all database shards.
Will be extended later to consider removing the table directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The idea is that we'll have a single on-completion interface for both
"in-strategy" and off-strategy compactions, so not to pollute table_state
with one interface for each.
replica::table::on_compaction_completion is being moved into private namespace.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
in_strategy_sstables() doesn't have to be implemented in table, as it's
simply about main set with maintenance and staging files filtered out.
Also, let's make it switch to table_state as part of ongoing work.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
storage_service/keyspaces?type=user along with user keyspaces returned
the keyspaces that were internal but non-system.
The list of the keyspaces for the user option
(storage_service/keyspaces?type=user) contains neither system nor
internal but only user keyspaces.
Fixes: #11042Closes#11049
Now update_sstable_lists_on_off_strategy_completion() and
on_compaction_completion() can be called from the same unified
interface.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
To make it possible to add a single interface in table_state for
updating sstable list on behalf of both off-strategy and in-strategy
compactions, update_sstable_lists_on_off_strategy_completion() will
work with compaction_completion_desc too for describing sstable set
changes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, applying schema mutations involves flushing all schema
tables so that on restart commit log replay is performed on top of
latest schema (for correctness). The downside is that schema merge is
very sensitive to fdatasync latency. Flushing a single memtable
involves many syncs, and we flush several of them. It was observed to
take as long as 30 seconds on GCE disks under some conditions.
This patch changes the schema merge to rely on a separate commit log
to replay the mutations on restart. This way it doesn't have to wait
for memtables to be flushed. It has to wait for the commitlog to be
synced, but this cost is well amortized.
We put the mutations into a separate commit log so that schema can be
recovered before replaying user mutations. This is necessary because
regular writes have a dependency on schema version, and replaying on
top of latest schema satisfies all dependencies. Without this, we
could get loss of writes if we replay a write which depends on the
latest schema on top of old schema.
Also, if we have a separate commit log for schema we can delay schema
parsing for after the replay and avoid complexity of recognizing
schema transactions in the log and invoking the schema merge logic.
One complication with this change is that replay_position markers are
commitlog-domain specific and cannot cross domains. They are recorded
in various places which survive node restart: sstables are annotated
with the maximum replay position, and they are present inside
truncation records. The former annotation is used by "truncate"
operation to drop sstables. To prevent old replay positions from being
interpreted in the context in the new schema commitlog domain, the
change refuses to boot if there are truncation records, and also
prohibits truncation of schema tables.
The boot sequence needs to know whether the cluster feature associated
with this change was enabled on all nodes. Fetaures are stored in
system.scylla_local. Because we need to read it before initializing
schema tables, the initialization of tables now has to be split into
two phases. The first phase initializes all system tables except
schema tables, and later we initialize schema tables, after reading
stored cluster features.
The commitlog domain is switched only when all nodes are upgraded, and
only after new node is restarted. This is so that we don't have to add
risky code to deal with hot-switching of the commitlog domain. Cold
switching is safer. This means that after upgrade there is a need for
yet another rolling restart round.
Fixes#8272Fixes#8309Fixes#1459
This reverts commit aa8f135f64, reversing
changes made to 9a88bc260c. The patch
causes hangs during flush.
Also reverts parts of 411231da75 that impacted the unit test.
Fixes#10897.
- Use `sstables::generation_type` in more places
- Enforce conceptual separation of `sstables::generation_type` and `int64_t`
- Fix `extremum_tracker` so that `sstables::generation_type` can be non-default-constructible
Fixes#10796.
Closes#10844
* github.com:scylladb/scylla:
sstables: make generation_type an actual separate type
sstables: use generation_type more soundly
extremum_tracker: do not require default-constructible value types
This series decouples the staging sstables from the table's sstable set.
The current behavior keeps the sstables in the staging directory until view building is done. They are readable as any other sstable, but fenced off from compaction, so they don't go away in the meanwhile.
Currently, when views are built, the sstables are moved into the main table directory where they will then be compacted normally.
The problem with this design is that the staging sstables are never compacted, in particular they won't get cleaned up or scrubbed.
The cleanup scenario open a backdoor for data resurrection when the staging sstables are moved after view building while possibly containing stale partitions (#9559) which will not be cleaned up until next time cleanup compaction is performed.
With this series, SSTables that are created in or moved to the staging sub-directory are "cloned" into the base table directory by hard-linking the components there and creating a new sstable object which loads the cloned files.
The former, in the staging directory is used solely for view building and is not added to the table's sstable set, while the latter, its clone, behaves like any other sstable and is added either to the regular or maintenance set and is read and compacted normally.
When view building is done, instead of moving the staging sstable into the table's base directory, it is simply unlinked.
If its "clone" wasn't compacted away yet, then it will just remain where it is, exactly like it would be after it was moved there in the present state of things. If it was already compacted and no longer exists, then unlinking will then free its storage.
Note that snapshot is based on the sstables listed by the table, which do not include the staging sstables with this change.
But that shouldn't matter since even today, the sstables in the snapshot has no notion of "staging" directory and it is expected that the MV's are either updated view `nodetool refresh` if restoring sstables from snapshot using the uploads dir, or if restoring the whole table from backup - MV's are effectively expected to be rebuilt from scratch (they are not included in automatic snapshots anyway since we don't have snapshot-coherency across tables).
A fundamental infrastructure change was done to achieve that which is to change the sstable_list which was a std::unordered_set<shared_sstable> into a std::unordered_map<generation_type, shared_sstable> that keeps the shared_sstable objects indexed by generation number (that must be unique). With this model, sstables are supposed to be searched by the generation number, not by their pointer, since when the staging sstable is clones, there will be 2 shared_sstable objects with the same generation (and different `dir()`) and we must distinguish between them.
Special care was taken to throw a runtime_error exception if when looking up a shared sstable and finding another one with the same generation, since they must never exist in the same sstable_map.
Fixes#9559Closes#10657
* github.com:scylladb/scylla:
table: clone staging sstables into table dir
view_update_generator: discover_staging_sstables: reindent
table: add get_staging_sstables
view_update_generator: discover_staging_sstables: get shared table ptr earlier
distributed_loader: populate table directory first
sstables: time_series_sstable_set: insert: make exception safe
sstables: move_to_new_dir: fix debug log message
clone staging sstables so their content may be compacted while
views are built. When done, the hard-linked copy in the staging
subdirectory will be simply unlinked.
Fixes#9559
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We don't have to go over all sstables in the table to select the
staging sstables out of them, we can get it directly from the
_sstables_staging map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Adds statistics which count how many times a replica has decided to
reject a write ("total_writes_rate_limited") or a read
("total_reads_rate_limited").
Adds the `db::rate_limiter` to the `database` class and modifies the
`query` and `apply` methods so that they account the read/write
operations in the rate limiter and optionally reject them.
`generation_type` is (supposed to be) conceptually different from
`int64_t` (even if physically they are the same), but at present
Scylla code still largely treats them interchangeably.
In addition to using `generation_type` in more places, we
provide (no-op) `generation_value()` and `generation_from_value()`
operations to make the smoke-and-mirrors more believable.
The churn is considerable, but all mechanical. To avoid even
more (way, way more) churn, unit test code is left untreated for
now, except where it uses the affected core APIs directly.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Make struct scheduling_group be sub-class of the backlog controller. Its
new meaning is now -- the group under controller maintenance. Both
database and compaction manager derive their sched groups from this one.
This makes backlog controller construction simpler, prepares the ground
for sched groups unification in seastar and facilitates next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similar to previous patch that made the same for compaction manager. The
newly introduced private scheduling_group class is temporary and will go
away in next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If we reach a situation where flush rate exceeds compaction rate, we may
end up with arbitrarily large number of sstables on disk. If a read is
executed in such case, the amount of memory required is proportional to
the number of sstables for the given shard, which in extreme cases can
lead to OOM.
In the wild, this was observed in 2 scenarios:
- A node with >10 shards creates a keyspace with thousands of tables,
drops the keyspace and shuts down before compaction finishes. Dropping
keyspace drops tables, and each dropped table is smp::count writes to
system.local table with flush after write, which creates tens of
thousands of sstables. Bootstrap read from system.local will run OOM.
- A failure to agree on table schema (due to a code bug) between nodes
during repair resulted in excessive flushing of small sstables which
compaction couldn't keep up with.
In the unit test introduced in this patch series it can be proved that
even hard setting maximum shares for compaction and minimum shares for
flushing doesn't tilt the balance towards compaction enough to prevent
the problem. Since it's a fast producer, slow consumer problem, the
remaining solution is to block producer until the consumer catches up.
If there are too many table runs originating from memtable, we block the
current flush until the number of sstables is reduced (via ongoing
compaction or a truncate operation).
The manager reference is already available in constructor and thus
can be copied to on-table member.
The code that chooses the manager (user/system one) should be moved
from make_column_family_config() into add_column_family() method.
Once this happens, the get_sstables_manager() should be fixed to
return the reference from its new location. While at it -- mark the
method in question noexcept and add it's mutable overload.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In core code there's only one place that constructs table -- in
database.cc -- and this place currently has the sstables_manager pointer
sitting on table config (despite it's a pointer, it's always non-null).
All the tests always use the manager from one of _env's out there.
For now the new contructor arg is unused.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No functional changes intended - this series is quite verbose,
but after it's in, it should be considerably easier to change
the type of SSTable generations to something else - e.g. a string
or timeUUID.
Closes#10533
And move the logic from snapshot-ctl down to the
replica::database layer.
A following patch will move the flush phase
from the replica::table::snapshot layer
out to the caller.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
There's a cql_type_parser::parse() method that needs to get user
types for a keyspace by its name. For this it uses the global
storage proxy instance as a place to get database from. This set
introduces an abstract user_types_storage helper object that's
responsible in providing the user types for the caller.
This helper, in turn, is provided to the parse() method by the
database itself or by the schema_ctxt object that needs parse()
to unfreeze schemas and doesn't have database at those times.
This removes one more get_storage_proxy() call.
"
* 'br-user-types-storage' of https://github.com/xemul/scylla:
cql_type_parser: Require user_types_storage& in parse()
schame_tables: Add db/ctxt args here and there
user_types: Carry storage on database and schema_ctxt
data_dictionary: Introduce user types storage