"
The view_info object, which is attached to the schema object of the
view, contains a data structure called
"base_non_pk_columns_in_view_pk". This data structure contains column
ids of the base table so is valid only for a particular version of the
base table schema. This data structure is used by materialized view
code to interpret mutations of the base table, those coming from base
table writes, or reads of the base table done as part of view updates
or view building.
The base table schema version of that data structure must match the
schema version of the mutation fragments, otherwise we hit undefined
behavior. This may include aborts, exceptions, segfaults, or data
corruption (e.g. writes landing in the wrong column in the view).
Before this patch, we could get schema version mismatch here after the
base table was altered. That's because the view schema did not change
when the base table was altered.
Another problem was that view building was using the current table's schema
to interpret the fragments and invoke view building. That's incorrect for two
reasons. First, fragments generated by a reader must be accessed only using
the reader's schema. Second, base_non_pk_columns_in_view_pk of the recorded
view ptrs may not longer match the current base table schema, which is used
to generate the view updates.
Part of the fix is to extract base_non_pk_columns_in_view_pk into a
third entity called base_dependent_view_info, which changes both on
base table schema changes and view schema changes.
It is managed by a shared pointer so that we can take immutable
snapshots of it, just like with schema_ptr. When starting the view
update, the base table schema_ptr and the corresponding
base_dependent_view_info have to match. So we must obtain them
atomically, and base_dependent_view_info cannot change during update.
Also, whenever the base table schema changes, we must update
base_dependent_view_infos of all attached views (atomically) so that
it matches the base table schema.
Fixes#7061.
Tests:
- unit (dev)
- [v1] manual (reproduced using scylla binary and cqlsh)
"
* tag 'mv-schema-mismatch-fix-v2' of github.com:tgrabiec/scylla:
db: view: Refactor view_info::initialize_base_dependent_fields()
tests: mv: Test dropping columns from base table
db: view: Fix incorrect schema access during view building after base table schema changes
schema: Call on_internal_error() when out of range id is passed to column_at()
db: views: Fix undefined behavior on base table schema changes
db: views: Introduce has_base_non_pk_columns_in_view_pk()
(cherry picked from commit 3daa49f098)
After 8014c7124, cleanup can potentially pick a compacting SSTable.
Upgrade and scrub can also pick a compacting SSTable.
The problem is that table::candidates_for_compaction() was badly named.
It misleads the user into thinking that the SSTables returned are perfect
candidates for compaction, but manager still need to filter out the
compacting SSTables from the returned set. So it's being renamed.
When the same SSTable is compacted in parallel, the strategy invariant
can be broken like overlapping being introduced in LCS, and also
some deletion failures as more than one compaction process would try
to delete the same files.
Let's fix scrub, cleanup and ugprade by calling the manager function
which gets the correct candidates for compaction.
Fixes#6938.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811200135.25421-1-raphaelsc@scylladb.com>
(cherry picked from commit 11df96718a)
The patch implements:
- /storage_service/auto_compaction API endpoint
- /column_family/autocompaction/{name} API endpoint
Those APIs allow to control and request the status of background
compaction jobs for the existing tables.
The implementation introduces the table::_compaction_disabled_by_user.
Then the CompactionManager checks if it can push the background
compaction job for the corresponding table.
New members
===
table::enable_auto_compaction();
table::disable_auto_compaction();
bool table::is_auto_compaction_disabled_by_user() const
Test
===
Tests: unit(sstable_datafile_test autocompaction_control_test), manual
$ ninja build/dev/test/boost/sstable_datafile_test
$ ./build/dev/test/boost/sstable_datafile_test --run_test=autocompaction_control_test -- -c1 -m2G --overprovisioned --unsafe-bypass-fsync 1 --blocked-reactor-notify-ms 2000000
The test tries to submit a compaction job after playing
with autocompaction control table switch. However, there is
no reliable way to hook pending compaction task. The code
assumed that with_scheduling_group() closure will never
preempt execution of the stats check.
Revert
===
Reverts commit c8247ac. In previous version the execution
sometimes resulted into the following error:
test/boost/sstable_datafile_test.cc(1076): fatal error: in "autocompaction_control_test":
critical check cm->get_stats().pending_tasks == 1 || cm->get_stats().active_tasks == 1 has failed
This version adds a few sstables to the cf, starts
the compaction and awaits until it is finished.
API change
===
- `/column_family/autocompaction/` always returned `true` while answering to the question: if the autocompaction disabled (see https://github.com/scylladb/scylla-jmx/blob/master/src/main/java/org/apache/cassandra/db/ColumnFamilyStore.java#L321). now it answers to the question: if the autocompaction for specific table is enabled. The question logic is inverted. The patch to the JMX is required. However, the change is decent because all old values were invalid (it always reported all compactions are disabled).
- `/column_family/autocompaction/` got support for POST/DELETE per table
Fixes
===
Fixes#1488Fixes#1808Fixes#440
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
Reviewed-by: Glauber Costa <glauber@scylladb.com>
The shutdown process of compaction manager starts with an explicit call
from the database object. However that can only happen everything is
already initialized. This works well today, but I am soon to change
the resharding process to operate before the node is fully ready.
One can still stop the database in this case, but reshardings will
have to finish before the abort signal is processed.
This patch passes the existing abort source to the construction of the
compaction_manager and subscribes to it. If the abort source is
triggered, the compaction manager will react to it firing and all
compactions it manages will be stopped.
We still want the database object to be able to wait for the compaction
manager, since the database is the object that owns the lifetime of
the compaction manager. To make that possible we'll use a future
that is return from stop(): no matter what triggered the abort, either
an early abort during initial resharding or a database-level event like
drain, everything will shut down in the right order.
The abort source is passed to the database, who is responsible from
constructing the compaction manager.
Tests: unit (dev), manual start+stop, manual drain + stop
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200506184749.98288-1-glauber@scylladb.com>
There's no indication that data needed for generating view updates
from staging sstables is going to be immediately useful for the
user, and a large amount of it can push hot rows out of the cache,
thus deteriorating performance.
Fixes#6233
Tests: unit(dev)
There is no reason to read a single SSTable at a time from the staging
directory. Moving SSTables from staging directory essentially involves
scanning input SSTables and creating new SSTables (albeit in a different
directory).
We have a mechanism that does that: compactions. In a follow up patch, I
will introduce a new specialization of compaction that moves SSTables
from staging (potentially compacting them if there are plenty).
In preparation for that, some signatures have to be changed and the
view_updating_consumer has to be more compaction friendly. Meaning:
- Operating with an sstable vector
- taking a table reference, not a database
Because this code is a bit fragile and the reviewer set is fundamentally
different from anything compaction related, I am sending this separately
* glommer-view_build:
staging: potentially read many SSTables at the same time
view_build_test: make sure it works with smp > 1
There is no reason to read a single SSTable at a time from the staging
directory. Moving SSTables from staging directory essentially involves
scanning input SSTables and creating new SSTables (albeit in a different
directory).
We have a mechanism that does that: compactions. In a follow up patch, I
will introduce a new specialization of compaction that moves SSTables
from staging (potentially compacting them if there are plenty).
In preparation for that, some signatures have to be changed and the
view_updating_consumer has to be more compaction friendly. Meaning:
- Operating with an sstable vector
- taking a table reference, not a database
Because this code is a bit fragile and the reviewer set is fundamentally
different from anything compaction related, I am sending this separately
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Rename inherited metrics cas_propose and cas_commit
to cas_accept and cas_learn respectively.
A while ago we made a decision to stick to widely accepted
terms for Paxos rounds: prepare, accept, learn. The rest
of the code is using these terms, so rename the metrics
to avoid confusion/technical debt.
While at it, rename a few internal methods and functions.
Fixes#6169
Message-Id: <20200414213537.129547-1-kostja@scylladb.com>
This reverts commit 1c444b7e1e. The test
it adds sometimes fails as follows:
test/boost/sstable_datafile_test.cc(1076): fatal error: in "autocompaction_control_test":
critical check cm->get_stats().pending_tasks == 1 || cm->get_stats().active_tasks == 1 has failed
Ivan is working on a fix, but let's revert this commit to avoid blocking
next promotion failing from time to time.
This patch adds API endpoint /column_family/autocompaction/{name}
that listen to GET and POST requests to pick and control table
background compactions.
To implement that the patch introduces "_compaction_disabled_by_user"
flag that affects if CompactionManager is allowed to push background
compactions jobs into the work.
It introduces
table::enable_auto_compaction();
table::disable_auto_compaction();
bool table::is_auto_compaction_disabled_by_user() const
to control auto compaction state.
Fixes#1488Fixes#1808Fixes#440
Tests: unit(sstable_datafile_test autocompaction_control_test), manual
There is no reason why the table code has to be aware of the efforts of
rewriting (cleanup, scrub, upgrade) an SSTable versus compacting it.
Rewrite is special, because we need to do it one SSTable at a time,
without lumping it together. However, the compaction manager is totally
capable of doing that itself. If we do that, the special
"table::rewrite_sstables" can be killed.
This code would maybe be better off as a thread, where we wouldn't need
to keep state. However there are some methods like maybe_stop_on_error()
that expect a future so I am leaving this be for now. This is a cleanup
that can be done later.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200401162722.28780-2-glauber@scylladb.com>
During bootstrap and replace operations the node can't take reads and
we'd like to see the process ending ASAP. This is because until the
process ends, we keep having to duplicate writes to an extended set. Not
to mention, in the case of a cluster expansion users want to use the
added capacity sooner rather than later.
Streaming generates a lot of compaction activity, that competes with the
bootstrap itself, slowing it down.
Long term, we are moving to treat those compactions differently and
maybe postpone them altogether. However for now we can reduce the amount
of compactions by increasing the minimum threshold of SSTables that have
to accumulate before they are selected for compactions. The default is
2, meaning we will trigger a compaction every time 2 SSTables of about
the same size are found (for STCS, others follow a similar pattern).
Until we have offstrategy infrastructure we don't want the compactions
to stop happening altogether so the reads, when they start, don't
suffer. This patch sets the minimum threshold to 16 (for the default
max_threshold of 32), meaning we will generate a lot less compaction
activity during streaming. Once streaming is done we revert it to its
original.
Unfortunately there isn't much we can do at the moment about decommission.
During decommission the nodes receiving data are also taking reads and
we don't want SSTables to accumulate.
Fixes#5109
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Before this patch, when db/view/view.hh was modified, 89 source files had to
be recompiled. After this patch, this number is down to 5.
Most of the irrelevant source files got view.hh by including database.hh,
which included view.hh just for the definition of statistics. So in this
patch we split the view statistics to a separate header file, view_stats.hh,
and database.hh only includes that. A few source files which included
only database.hh and also needed view.hh (for materialized-view related
functions) now need to include view.hh explicitly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200319121031.540-1-nyh@scylladb.com>
The function already takes schema so there's no need
for it to take partitioner. It can be obtained using
schema::get_partitioner
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The header sits in many other headers, but there's a handy
schema_fwd.hh that's tiny and contains needed declarations
for other headers. So replace shema.hh with schema_fwd.hh
in most of the headers (and remove completely from some).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200303102050.18462-1-xemul@scylladb.com>
"
Timeouts defaulted to `db::no_timeout` are dangerous. They allow any
modifications to the code to drop timeouts and introduce a source of
unbounded request queue to the system.
This series removes the last such default timeouts from the code. No
problems were found, only test code had to be updated.
tests: unit(dev)
"
* 'no-default-timeouts/v1' of https://github.com/denesb/scylla:
database: database::query*(), database::apply*(): remove default timeouts
database: table::query(): remove default timeout
mutation_query: data_query(): remove default timeout
mutation_query: mutation_query(): remove default timeout
multishard_mutation_query: query_mutations_on_all_shards(): remove default timeout
reader_concurrency_semaphore: wait_admission(): remove default timeout
utils/logallog: run_when_memory_available(): remove default timeout
We have a few kind of queries whose memory consumption is not limited at
all. One of these is reverse queries, which reads entire partitions into
memory, before reversing them. These partitions can be larger than
memory and thus such a query can single-handedly cause OOM.
This patch introduces a configuration for a memory limit for such
queries. This will serve as a hard limit and queries which attempt to
use more memory than this, will be aborted.
The limit is propagated to table objects, with the intention of keeping
system tables unlimited. These tables are usually small and initiators
of system queries are not prepared for failures.
This series adds an option to the API that supports deleting
a specific table from a snapshot.
The implementation works in a similar way to the option
to specify specific keyspaces when deleting a snapshot.
The motivation is to allow reducing disk-space when using
the snapshot for backup. A dtest PR is sent to the dtest
repository.
Fixes#5658
Original PR #5805
Tests: (database_test) (dtest snapshot_test.py:TestSnapshot.test_cleaning_snapshot_by_cf)
* amnonh/delete_table_snapshot:
test/boost/database_test: adopt new clear_snapshot signature
api/storage_service: Support specifying a table when deleting a snapshot
storage_service: Add optional table name to clear snapshot
* amnonh/delete_table_snapshot:
test/boost/database_test: adopt new clear_snapshot signature
api/storage_service: Support specifying a table when deleting a snapshot
storage_service: Add optional table name to clear snapshot
Row cache needs to be invalidated whenever data in sstables
changes. Cleanup removes data from sstables which doesn't belong to
the node anymore, which means cache must be invalidated on cleanup.
Currently, stale data can be returned when a node re-owns ranges which
data are still stored in the node's row cache, because cleanup didn't
invalidate the cache."
Fixes#4446.
tests:
- unit tests (dev mode)
- dtests:
update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_decommission_node_2_test
cleanup_test.py
This descriptor contain all information needed for table to be properly
updated on compaction completion. A new member will be added to it soon,
which will store ranges to be invalidated in row cache on behalf of
cleanup compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Now the database keeps reference on feature service, so we
can listen on the feature in it directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are cases when it is useful to delete specific table from a
snapshot.
An example is when a snapshot is used for backup. Backup can take a long
period of time, during that time, each of the tables can be deleted once
it was backup without waiting for the entire backup process to
completed.
This patch adds such an option to the database and to the storage_service
wrapping method that calls it.
If a table is specified a filter function is created that filter only
the column family with that given name.
This is similar to the filtering at the keyspace level.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
When replaying a hint with a destination node that is no longer in the
cluster, it will be sent with cl=ALL to all its new replicas. Before
this patch, the MUTATION verb was used, which causes such hints to be
handled on the same connection and with the same priority as regular
writes. This can cause problems when a large number of hints is
orphaned and they are scheduled to be sent at once. Such situation
may happen when replacing a dead node - all nodes that accumulated hints
for the dead node will now send them with cl=ALL to their new replicas.
This patch changes the verb used to send such hints to HINT_MUTATION.
This verb is handled on a separate connection and with streaming
scheduling group, which gives them similar priority to non-orphaned
hints.
Refs: #4712
Tests: unit(dev)
" from Botond
Nodetool scrub rewrites all sstables, validating their data. If corrupt
data is found the scrub is aborted. If the skip-corrupted flag is set,
corrupt data is instead logged (just the keys) and skipped.
The scrubbing algorithm itself is fairly simple, especially that we
already have a mutation stream validator that we can use to validate the
data. However currently scrub is piggy-backed on top of cleanup
compaction. To implement this flag, we have to make scrub a separate
compaction type and propagate down the flag. This required some
massaging of the code:
* Add support for more than two (cleanup or not) compaction types.
* Allow passing custom options for each compaction type.
* Allow stopping a compaction without the manager retrying it later.
Additionally the validator itself needed some changes to allow different
ways to handle errors, as needed by the scrub.
Fixes: #5487
* https://github.com/denesb/nodetool-scrub-skip-corrupted/v7:
table: cleanup_sstables(): only short-circuit on actual cleanup
compaction: compaction_type: add Upgrade
compaction: introduce compaction_options
compaction: compaction_descriptor: use compaction options instead of
cleanup flag
compaction_manager: collect all cleanup related logic in
perform_cleanup()
sstables: compaction_stop_exception: add retry flag
mutation_fragment_stream_validator: split into low-level and
high-level API
compaction: introduce scrub_compaction
compaction_manager: scrub: don't piggy-back on upgrade_sstables()
test: sstable_datafile_test: add scrub unit test
Take const schema& as a parameter of shard_of and
use it to obtain partitioner instead of calling
global_partitioner().
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Currently the call chain for a cleanup collection looks like this:
compaction_manager::perform_cleanup()
compaction_manager::rewrite_sstables()
table::cleanup_sstables()
...
`perform_cleanup()` is essentially empty, immediately deferring to
`rewrite_sstables()`. Cleanup related logic is scattered between the
latter two methods on the call chain. These methods however recently
started serving as generic methods for compactions that want to
rewrite each sstable one-by-one, collecting cleanup related ifs in
various places.
The reason is historic, we first had cleanup, then bolted others on top,
trying to share the underlying code as much as possible.
It is time this is cleaned up (pun intended). Make `perform_cleanup()`
the place where all cleanup related logic is, with the rest of the stack
made truly generic.
Instead of the restrictive `cleanup` boolean flag, which allows for choosing
between only two compaction types, use `compaction_options`, which in
addition to allowing any number of compaction types to be selected,
also allows seamlessly passing specific options to them.
The goal is to have token_metadata reference intide the
keyspace_metadata.validate method. This can be acheived
by doing the validation through the database reference
which is "at hands" in migration_manager.
While at it, merge the validation with exists/not-exists
checks done in the same places.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
There's a lot of code around that needs storage service purely to
get the specific feature value (cluster_supports_<something> calls).
This creates several circular dependencies, e.g. storage_service <->
migration_manager one and database <-> storage_servuce. Also features
sit on storage_service, but register themselfs on the feature_service
and the former subscribes on them back which also looks strange.
I propose to keep all the features on feature_service, this keeps the
latter intependent from other components, makes it possible to break
one of the mentioned circle dependencyand heavily relax the other.
Also the set helps us fighting the globals and, after it, the
feature_service can be safely stopped at the very last moment.
Tests: unit(dev), manual debug build start-stop
"
* 'br-features-to-service-5' of https://github.com/xemul/scylla:
gossiper: Avoid string merge-split for nothing
features: Stop on shutdown
storage_service: Remove helpers
storage_service: Prepare to switch from on-board feature helpers
cql3: Check feature in .validate
database: Use feature service
storage_proxy: Use feature service
migration_manager: Use feature service
start: Pass needed feature as argument into migrate_truncation_records
features: Unfriend storage_service
features: Simplify feature registration
features: Introduce known_feature_set
features: Move disabled features set from storage_service
features: Move schema_features helper
features: Move all features from storage_service to feature_service
storage_service: Use feature_config from _feature_service
features: Add feature_config
storage_service: Kill set_disabled_features
gms: Move features stuff into own .cc file
migration_manager: Move some fns into class
Allows caller to check/wait for a given user keyspace to finish
populating on boot.
Can be called at any time, though if called before population
starts, it will wait until it either starts and we can determine
that the keyspace does not need populating, or population finishes.
tests: unit
Message-Id: <20200203151712.10003-1-calle@scylladb.com>
The disk-error-handler is purely auxiliary thing that helps
propagating IO errors to the rest of the code. It well
deserves not sitting in the root namespace.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200207112443.18475-1-xemul@scylladb.com>
Truncation time is used on each LWT request now, so reading it from
the table is too heave operation to be on a fast path. It also requires
jumping to a shard that contains corresponding data. This patch caches
the data on the table object of each shard for easy access. The cache is
initialized during boot from system.truncated table and updated on each
truncation operation.
Message-Id: <20200206163838.5220-2-gleb@scylladb.com>
This is not just a direct flip to a variable with the negated Boolean
value. When created, a large_data_handler is not considered to be
running, the user has to call start() before it can be used.
The advantaged of doing this is that if initialization fails and a
database is destructed before the large_data_handler is started, the
assert
database::stop() {
assert(!_large_data_handler->running());
is not triggered.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Keep local feature_service reference on database. This relaxes the
circular storage_service <-> database reference, but not removes it
completely.
This needs some args tossing in apply_to_builder, but it's
rather straightforward, so comes in the same patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Attempting to apply timed-out writes is a wasted effort. The coordinator
have already given up on the write and reported it as failed to the
client. Any cycles spent on this write is a waste at this point.
We currently only check the timeout if the write is blocked on memory,
otherwise, if the system is not under pressure, we will happily apply
timed out writes. If the system is under pressure we will make it worse
by wasting cycles on processing a timed out write.
Prevent this by checking the timeout as early as possible in
`database::apply()` and `database::apply_counter_update()`.
This patch doesn't solve all our problems related to timed out writes.
They can still sit and accumulate in various queues without expiring, a
prominent example being the smp queues. It is however a good first step
towards reducing wasted effort spent on them.
Refs: #5055
Ref #5251
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200129093007.550250-1-bdenes@scylladb.com>
The former was never really more than a reader_permit with one
additional method. Currently using it doesn't even save one from any
includes. Now that readers will be using reader_permit we would have to
pass down both to mutation_source. Instead get rid of
reader_resource_tracker and just use reader_permit. Instead of making it
a last and optional parameter that is easy to ignore, make it a
first class parameter, right after schema, to signify that permits are
now a prominent part of the reader API.
This -- mostly mechanical -- patch essentially refactors mutation_source
to ask for the reader_permit instead of reader_resource_tracking and
updates all usage sites.
The set make dependencies between mm and other services cleaner,
in particular, after the set:
- the query processor no longer needs migration manager
(which doesn't need query processor either)
- the database no longer needs migration manager, thus the mutual
dependency between these two is dropped, only migration manager
-> database is left
- the migration manager -> storage_service dependency is relaxed,
one more patchset will be needed to remove it, thus dropping one
more mutual dependency between them, only the storage_service
-> migration manager will be left
- the migration manager is stopped on drain, but several more
services need it on stop, thus causing use after free problems,
in particular there's a caught bug when view builder crashes
when unregistering from notifier list on stop. Fixed.
Tests: unit(dev)
Fixes: #5404
Merged pull request https://github.com/scylladb/scylla/pull/5294 from
Amnon Heiman:
To use a snapshot we need a schema file that is similar to the result of
running cql DESCRIBE command.
The DESCRIBE is implemented in the cql driver so the functionality needs
to be re-implemented inside scylla.
This series adds a describe method to the schema file and use it when doing
a snapshot.
There are different approach of how to handle materialize views and
secondary indexes.
This implementation creates each schema.cql file in its own relevant
directory, so the schema for materializing view, for example, will be
placed in the snapshot directory of the table of that view.
Fixes#4192