We want to start tracking the memory consumption of mutation fragments.
For this we need schema and permit during construction, and on each
modification, so the memory consumption can be recalculated and pass to
the permit.
In this patch we just add the new parameters and go through the insane
churn of updating all call sites. They will be used in the next patch.
Not used yet, this patch does all the churn of propagating a permit
to each impl.
In the next patch we will use it to track to track the memory
consumption of `_buffer`.
sstables_manager will soon be closed asynhronously, with a future-returning
close() function. To prepare for that, make the following changes
- replace on-stack test_env with test_env::do_with()
- use the variant of column_family_for_tests that accepts an sstables_manager
- replace test_sstables_manager with an sstables_manager obtained from test_env
These changes allow lifetime management of the sstables_manager used
in the tests to be centralized in test_env.
Since test_env now calls await_background_jobs on termination, those
calls are dropped.
The pattern
return function_returning_a_future().get();
is legal, but confusing. It returns an unexpected std::tuple<>. Here,
it doesn't do any harm, but if we try to coerce the surrounding code
into a signature (void ()), then that will fail.
Remove the unneeded and unexpected return.
Stopping a table will soon close its sstables; so the next check will fail
as the number of sstables for the table will be zero.
Reorder the stop() call to make it safe.
We don't need the stop() for the check, since the previous loop made sure
compactions completed.
The database is available at both places that create the options --
tests and API perform_cleanup call.
Options object doesn't over-survive the returned future, so it's
safe to keep the reference on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After 8014c7124, cleanup can potentially pick a compacting SSTable.
Upgrade and scrub can also pick a compacting SSTable.
The problem is that table::candidates_for_compaction() was badly named.
It misleads the user into thinking that the SSTables returned are perfect
candidates for compaction, but manager still need to filter out the
compacting SSTables from the returned set. So it's being renamed.
When the same SSTable is compacted in parallel, the strategy invariant
can be broken like overlapping being introduced in LCS, and also
some deletion failures as more than one compaction process would try
to delete the same files.
Let's fix scrub, cleanup and ugprade by calling the manager function
which gets the correct candidates for compaction.
Fixes#6938.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811200135.25421-1-raphaelsc@scylladb.com>
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.
`contains` does not only express the intend of the code better but also
does it in more unified way.
This commit replaces all the occurences of the `count` with the
`contains`.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
"
This series adds support for the "md" sstable format.
Support is based on the following:
* do not use clustering based filtering in the presence
of static row, tombstones.
* Disabling min/max column names in the metadata for
formats older than "md".
* When updating the metadata, reset and disable min/max
in the presence of range tombstones (like Cassandra does
and until we process them accurately).
* Fix the way we maintain min/max column names by:
keeping whole clustering key prefixes as min/max
rather than calculating min/max independently for
each component, like Cassandra does in the "md" format.
Fixes#4442
Tests: unit(dev), cql_query_test -t test_clustering_filtering* (debug)
md migration_test dtest from git@github.com:bhalevy/scylla-dtest.git migration_test-md-v1
"
* tag 'md-format-v4' of github.com:bhalevy/scylla: (27 commits)
config: enable_sstables_md_format by default
test: cql_query_test: add test_clustering_filtering unit tests
table: filter_sstable_for_reader: allow clustering filtering md-format sstables
table: create_single_key_sstable_reader: emit partition_start/end for empty filtered results
table: filter_sstable_for_reader: adjust to md-format
table: filter_sstable_for_reader: include non-scylla sstables with tombstones
table: filter_sstable_for_reader: do not filter if static column is requested
table: filter_sstable_for_reader: refactor clustering filtering conditional expression
features: add MD_SSTABLE_FORMAT cluster feature
config: add enable_sstables_md_format
database: add set_format_by_config
test: sstable_3_x_test: test both mc and md versions
test: Add support for the "md" format
sstables: mx/writer: use version from sstable for write calls
sstables: mx/writer: update_min_max_components for partition tombstone
sstables: metadata_collector: support min_max_components for range tombstones
sstable: validate_min_max_metadata: drop outdated logic
sstables: rename mc folder to mx
sstables: may_contain_rows: always true for old formats
sstables: add may_contain_rows
...
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
the code pattern looked like:
<collection>.find(<element>) != <collection>.end()
In C++20 the same can be expressed with:
<collection>.contains(<element>)
This is not only more concise but also expresses the intend of the code
more clearly.
This commit replaces all the occurences of the old pattern with the new
approach.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f001bbc356224f0c38f06ee2a90fb60a6e8e1980.1597132302.git.piotr@scylladb.com>
Partition tombstones represent an implicit clustering range
that is unbound on both sides, so reflect than in min/max
column names metadata using empty clustering key prefixes.
If we don't do that, when using the sstable for filtering, we have no
other way of distinguishing range tombstones from partition tombstones
given the sstable metadata and we would need to include any sstable
with tombstones, even if those are range tombstone, for which
we can do a better filtering job, using the sstable min/max
column names metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We essentially treat min/max column names as range bounds
with min as incl_start and max as incl_end.
By generating a bound_view for min/max column names on the fly,
we can correctly track and compare also short clustering
key prefixes that may be used as bounds for range tombstones.
Extend the sstable_tombstone_metadata_check unit test
to cover these cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add the sstable_version_types::md enum value
and logically extend sstable_version_types comparisons to cover
also the > sstable_version_types::mc cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently we compare each min/max component independently.
This may lead to suboptimal, inclusive clustering ranges
that do not indicate any actual key we encountered.
For example: ['a', 2], ['b', 1] will lead to min=['a', 1], max=['b', 2]
instead of the keys themselves.
This change keeps the min or max keys as a whole.
It considers shorter clustering prefixes (that are possible with compact
storage) as range tombstone bounds, so that a shorter key is considered
less than the minimum if the latter has a common prefix, and greater
than the maximum if the latter has a common prefix.
Extend the min_max_clustering_key_test to test for this case.
Previously {"a", "2"}, {"b", "1"} clustering keys would erronuously
end up with min={"a", "1"} max={"b", "2"} while we want them to be
min={"a", "2"} max={"b", "1"}.
Adjust sstable_3_x_test to ignore original mc sstables that were
previously computed with different min/max column names.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
While working on another patch I was getting odd compiler errors
saying that a call to ::make_shared was ambiguous. The reason was that
seastar has both:
template <typename T, typename... A>
shared_ptr<T> make_shared(A&&... a);
template <typename T>
shared_ptr<T> make_shared(T&& a);
The second variant doesn't exist in std::make_shared.
This series drops the dependency in scylla, so that a future change
can make seastar::make_shared a bit more like std::make_shared.
"
* 'espindola/make_shared' of https://github.com/espindola/scylla:
Everywhere: Explicitly instantiate make_lw_shared
Everywhere: Add a make_shared_schema helper
Everywhere: Explicitly instantiate make_shared
cql3: Add a create_multi_column_relation helper
main: Return a shared_ptr from defer_verbose_shutdown
In some cases estimated number of partitions can be 0, which is albeit a
legit estimation result, breaks many low-level sstable writer code, so
some of these have assertions to ensure estimated partitions is > 0.
To avoid hitting this assert all users of the sstable writers do the
clamping, to ensure estimated partitions is at least 1. However leaving
this to the callers is error prone as #6913 has shown it. As this
clamping is standard practice, it is better to do it in the writers
themselves, avoiding this problem altogether. This is exactly what this
patch does. It also adds two unit tests, one that reproduces the crash
in #6913, and another one that ensures all sstable writers are fine with
estimated partitions being 0 now. Call sites previously doing the
clamping are changed to not do it, it is unnecessary now as the writer
does it itself.
Fixes#6913
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>
seastar::make_lw_shared has a constructor taking a T&&. There is no
such constructor in std::make_shared:
https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared
This means that we have to move from
make_lw_shared(T(...)
to
make_lw_shared<T>(...)
If we don't want to depend on the idiosyncrasies of
seastar::make_lw_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This replaces a lot of make_lw_shared(schema(...)) with
make_shared_schema(...).
This makes it easier to drop a dependency on the differences between
seastar::make_shared and std::make_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
needs_cleanup() returns true if a sstable needs cleanup.
Turns out it's very slow because it iterates through all the local
ranges for all sstables in the set, making its complexity:
O(num_sstables * local_ranges)
We can optimize it by taking into account that abstract_replication_strategy
documents that get_ranges() will return a list of ranges that is sorted
and non-overlapping. Compaction for cleanup already takes advantage of that
when checking if a given partition can be actually purged.
So needs_cleanup() can be optimized into O(num_sstables * log(local_ranges)).
With num_sstables=1000, RF=3, then local_ranges=256(num_tokens)*3, it means
the max # of checks performed will go from 768000 to ~9584.
Fixes#6730.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-2-raphaelsc@scylladb.com>
For tombstone expiration to proceed correctly without the risk of resurrecting
data, the sstable set must be present.
Regular compaction and derivatives provide the sstable set, so they're able
to expire tombstones with no resurrection risk.
Resharding, on the other hand, can run on any shard, not necessarily on the
same shard that one of the input sstables belongs to, so it currently cannot
provide a sstable set for tombstone expiration to proceed safely.
That being said, let's only do expiration based on the presence of the set.
This makes room for the sstable set to be feeded to compaction via descriptor,
allowing even resharding to do expiration. Currently, compaction thinks that
sstable set can only come from the table, and that also needs to be changed
for further flexibility.
It's theoretically possible that a given resharding job will resurrect data if
a fully expired SSTable is resharded at a shard which it doesn't belong to.
Resharding will have no way to tell that expiring all that data will lead to
resurrection because the relevant SSTables are at different shards.
This is fixed by checking for fully expired sstables only on presence of
the sstable set.
Fixes#6600.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200605200954.24696-1-raphaelsc@scylladb.com>
New SStables are only added to backlog tracker if set_unshared() was
called on their behalf. SStables created for streaming are not being
added to the tracker because make_streaming_sstable_for_write()
doesn't call set_unshared() nor does it caller. Which results in backlog
not accounting for their existence, which means backlog will be much
lower than expected.
This problem could be fixed by adding a set_unshared() call but it
turns out we don't even need set_unshared() anymore. It was introduced
when Scylla metadata didn't exist, now a SSTable has built-in knowledge
of whether or not it's shared. Relying on every SSTable creator calling
set_unshared() is bug prone. Let's get rid of it and let the SStable
itself say whether or not it's shared. If an imported SSTable has not
Scylla metadata, Scylla will still be able to compute shards using
token range metadata.
Refs #6021.
Refs #6227.
Fixes#6441.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200512220226.134481-1-raphaelsc@scylladb.com>
"
Currently we classify queries as "system" or "user" based on the table
they target. The class of a query determines how the query is treated,
currently: timeout, limits for reverse queries and the concurrency
semaphore. The catch is that users are also allowed to query system
tables and when doing so they will bypass the limits intended for user
queries. This has caused performance problems in the past, yet the
reason we decided to finally address this is that we want to introduce a
memory limit for unpaged queries. Internal (system) queries are all
unpaged and we don't want to impose the same limit on them.
This series uses scheduling groups to distinguish user and system
workloads, based on the assumption that user workloads will run in the
statement scheduling group, while system workloads will run in the main
(or default) scheduling group, or perhaps something else, but in any
case not in the statement one. Currently the scheduling group of reads
and writes is lost when going through the messaging service, so to be
able to use scheduling groups to distinguish user and system reads this
series refactors the messaging service to retain this distinction across
verb calls. Furthermore, we execute some system reads/writes as part of
user reads/writes, such as auth and schema sync. These processes are
tagged to run in the main group.
This series also centralises query classification on the replica and
moves it to a higher level. More specifically, queries are now
classified -- the scheduling group they run in is translated to the
appropriate query class specific configuration -- on the database level
and the configuration is propagated down to the lower layers.
Currently this query class specific configuration consists of the reader
concurrency semaphore and the max memory limit for otherwise unlimited
queries. A corollary of the semaphore begin selected on the database
level is that the read permit is now created before the read starts. A
valid permit is now available during all stages of the read, enabling
tracking the memory consumption of e.g. the memtable and cache readers.
This change aligns nicely with the needs of more accurate reader memory
tracking, which also wants a valid permit that is available in every layer.
The series can be divided roughly into the following distinct patch
groups:
* 01-02: Give system read concurrency a boost during startup.
* 03-06: Introduce user/system statement isolation to messaging service.
* 07-13: Various infrastructure changes to prepare for using read
permits in all stages of reads.
* 14-19: Propagate the semaphore and the permit from database to the
various table methods that currently create the permit.
* 20-23: Migrate away from using the reader concurrency semaphore for
waiting for admission, use the permit instead.
* 24: Introduce `database::make_query_config()` and switch the database
methods needing such a config to use it.
* 25-31: Get rid of all uses of `no_reader_permit()`.
* 32-33: Ban empty permits for good.
* 34: querier_cache: use the queriers' permits to obtain the semaphore.
Fixes: #5919
Tests: unit(dev, release, debug),
dtest(bootstrap_test.py:TestBootstrap.start_stop_test_node), manual
testing with a 2 node mixed cluster with extra logging.
"
* 'query-class/v6' of https://github.com/denesb/scylla: (34 commits)
querier_cache: get semaphore from querier
reader_permit: forbid empty permits
reader_permit: fix reader_resources::operator bool
treewide: remove all uses of no_reader_permit()
database: make_multishard_streaming_reader: pass valid permit to multi range reader
sstables: pass valid permits to all internal reads
compaction: pass a valid permit to sstable reads
database: add compaction read concurrency semaphore
view: use valid permits for reads from the base table
database: use valid permit for counter read-before-write
database: introduce make_query_class_config()
reader_concurrency_semaphore: remove wait_admission and consume_resources()
test: move away from reader_concurrency_semaphore::wait_admission()
reader_permit: resource_units: introduce add()
mutation_reader: restricted_reader: work in terms of reader_permit
row_cache: pass a valid permit to underlying read
memtable: pass a valid permit to the delegate reader
table: require a valid permit to be passed to most read methods
multishard_mutation_query: pass a valid permit to shard mutation sources
querier: add reader_permit parameter and forward it to the mutation_source
...
GC writer, used for incremental compaction, cannot be currently used if interposer
consumer is used. That's because compaction assumes that GC writer will be operated
only by a single compaction writer at a given point in time.
With interposer consumer, multiple writers will concurrently operate on the same
GC writer, leading to race condition which potentially result in use-after-free.
Let's disable GC writer if interposer consumer is enabled. We're not losing anything
because GC writer is currently only needed on strategies which don't implement an
interposer consumer. Resharding will always disable GC writer, which is the expected
behavior because it doesn't support incremental compaction yet.
The proper fix, which allows GC writer and interposer consumer to work together,
will require more time to implement and test, and for that reason, I am postponing
it as #6472 is a showstopper for the current release.
Fixes#6472.
tests: mode(dev).
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200526195428.230472-1-raphaelsc@scylladb.com>
And use it to obtain any query-class specific configuration that was
obtained from `table::config` before, such as the read concurrency
semaphore and the max memory limit for unlimited queries. As all users
of these items get these from the query class config now, we can remove
them from `table::config`.
Now that the most prevalent users (range scan and single partition
reads) all pass valid permits we require all users to do so and
propagate the permit down towards `make_sstable_reader()`. The plan is
to use this permit for restricting the sstable readers, instead of the
semaphore the table is configured with. The various
`make_streaming_*reader()` overloads keep using the internal semaphores
as but they also create the permit before the read starts and pass it to
`make_sstable_reader()`.
We are having many issues with the stop code in the compaction_manager.
Part of the reason is that the "stopped" state has its meaning overloaded
to indicate both "compaction manager is not accepting compactions" and
"compaction manager is not ready or destructed".
In a later step we could default to enabled-at-start, but right now we
maintain current behavior to minimize noise.
It is only possible to stop the compaction manager once.
It is possible to enable / disable the compaction manager many times.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We shouldn't assume the I/O priority class for compactions. For
instance, if we are dealing with offstrategy compactions we may want to
use the maintenance group priority for them.
For now, all compactions are put in the compaction class. rewrite
compactions (scrub, cleanup) could be maintenance, but we don't have
clear access to the database object at this time to derive the
equivalent CPU priority. This is planned to be changed in the future,
and when we do change it, we'll adjust.
Same goes for resharding: while we could at this point change it we'd
risking memory pressure since resharding is run online and sstables are
shared until resharding is done. When we move it to offline execution
we'll do it with maintenance priority.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200512002233.306538-3-glauber@scylladb.com>
The patch implements:
- /storage_service/auto_compaction API endpoint
- /column_family/autocompaction/{name} API endpoint
Those APIs allow to control and request the status of background
compaction jobs for the existing tables.
The implementation introduces the table::_compaction_disabled_by_user.
Then the CompactionManager checks if it can push the background
compaction job for the corresponding table.
New members
===
table::enable_auto_compaction();
table::disable_auto_compaction();
bool table::is_auto_compaction_disabled_by_user() const
Test
===
Tests: unit(sstable_datafile_test autocompaction_control_test), manual
$ ninja build/dev/test/boost/sstable_datafile_test
$ ./build/dev/test/boost/sstable_datafile_test --run_test=autocompaction_control_test -- -c1 -m2G --overprovisioned --unsafe-bypass-fsync 1 --blocked-reactor-notify-ms 2000000
The test tries to submit a compaction job after playing
with autocompaction control table switch. However, there is
no reliable way to hook pending compaction task. The code
assumed that with_scheduling_group() closure will never
preempt execution of the stats check.
Revert
===
Reverts commit c8247ac. In previous version the execution
sometimes resulted into the following error:
test/boost/sstable_datafile_test.cc(1076): fatal error: in "autocompaction_control_test":
critical check cm->get_stats().pending_tasks == 1 || cm->get_stats().active_tasks == 1 has failed
This version adds a few sstables to the cf, starts
the compaction and awaits until it is finished.
API change
===
- `/column_family/autocompaction/` always returned `true` while answering to the question: if the autocompaction disabled (see https://github.com/scylladb/scylla-jmx/blob/master/src/main/java/org/apache/cassandra/db/ColumnFamilyStore.java#L321). now it answers to the question: if the autocompaction for specific table is enabled. The question logic is inverted. The patch to the JMX is required. However, the change is decent because all old values were invalid (it always reported all compactions are disabled).
- `/column_family/autocompaction/` got support for POST/DELETE per table
Fixes
===
Fixes#1488Fixes#1808Fixes#440
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
Reviewed-by: Glauber Costa <glauber@scylladb.com>
Garbage collected SSTable is incorrectly added to SSTable set with a function
that invalidates row cache. This problem is fixed by adding GC SStable
to set using mechanism which replaces old sstables with new sstables.
Also, adding GC SSTable to set in a separate call is not correct.
We should make sure that GC SSTable reaches the SSTable set at the same time
its respective old (input) SSTable is removed from the set, and that's done
using a single request call to table.
Fixes#5956.
Fixes#6275.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
input_sstables is renamed to old_sstables and is about old SSTables that should be
deleted and removed from the SSTable set.
output_sstables is renamed to new_sstables and is about new SSTable that should be
added to the SSTable set, replacing the old ones.
This will allow us, for example, to add auxiliary SSTables to SSTable set using
the same call which replaces output SSTables by input SSTables in compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The compaction_manager test lives inside a thread and it is not taking
advantage of it, with continuations all over.
One of the side effects of it is that the test is calling stop() twice
on the compaction_manager. While this works today, it is not good
practice. A change I am making is just about to break it.
This patch converts the test to fully use .get() instead of chained
continuations and in doing so also guarantees that the compaction
manager will be RAII-stopped just one, from a defer object.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200503161420.8346-2-glauber@scylladb.com>
This reverts commit 1c444b7e1e. The test
it adds sometimes fails as follows:
test/boost/sstable_datafile_test.cc(1076): fatal error: in "autocompaction_control_test":
critical check cm->get_stats().pending_tasks == 1 || cm->get_stats().active_tasks == 1 has failed
Ivan is working on a fix, but let's revert this commit to avoid blocking
next promotion failing from time to time.
This patch adds API endpoint /column_family/autocompaction/{name}
that listen to GET and POST requests to pick and control table
background compactions.
To implement that the patch introduces "_compaction_disabled_by_user"
flag that affects if CompactionManager is allowed to push background
compactions jobs into the work.
It introduces
table::enable_auto_compaction();
table::disable_auto_compaction();
bool table::is_auto_compaction_disabled_by_user() const
to control auto compaction state.
Fixes#1488Fixes#1808Fixes#440
Tests: unit(sstable_datafile_test autocompaction_control_test), manual
"
This patch makes makes major compaction aware of time buckets
for TWCS. That means that calling a major compaction with TWCS
will not bundle all SSTables together, but rather split them
based on their timestamps.
There are two motivations for this work:
Telling users not to ever major compact is easier said than
done: in practice due to a variety of circumstances it might
end up being done in which case data will have a hard time
expiring later.
We are about to start working with offstrategy compactions,
which are compactions that work in parallel with the main
compactions. In those cases we may be converting SSTables from
one format to another and it might be necessary to split a single
big STCS SSTable into something that TWCS expects
In order to achieve that, we start by changing the way resharding works:
it will now work with a read interposer, similar to the one TWCS uses for
streaming data. Once we do that, a lot of assumptions that exist in the
compaction code can be simplified and supporting TWCS major
compactions become a matter of simply enabling its interposer in the
compaction code as well.
There are many further simplifications that this work exposes:
The compaction method create_new_sstable seems out of place. It is not
used by resharding, and it seems duplicated for normal compactions. We
could clean it up with more refactoring in a later patch.
The whole logic of the feed_writer could be part of the consumer code.
Testing details:
scylla unit tests (dev, release)
sstable_datafile_test (debug)
dtests (resharding_test.py)
manual scylla resharding
Fixes#1431
"
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
* 'twcs-major-v3' of github.com:glommer/scylla:
compaction: make major compaction time-aware with TWCS
compaction: do resharding through an interposer
mutation_writer: introduce shard_based splitting writer
mutation_writer: factor out part of the code for the timestamp splitter
compaction: abort if create_new_sstable is called from resharding
This removes the need to include reactor.hh, a source of compile
time bloat.
In some places, the call is qualified with seastar:: in order
to resolve ambiguities with a local name.
Includes are adjusted to make everything compile. We end up
having 14 translation units including reactor.hh, primarily for
deprecated things like reactor::at_exit().
Ref #1
This patch makes makes major compaction aware of time buckets
for TWCS. That means that calling a major compaction with TWCS
will not bundle all SSTables together, but rather split them
based on their timestamps.
There are two motivations for this work:
1. Telling users not to ever major compact is easier said than
done: in practice due to a variety of circumstances it might
end up being done in which case data will have a hard time
expiring later.
2. We are about to start working with offstrategy compactions,
which are compactions that work in parallel with the main
compactions. In those cases we may be converting SSTables from
one format to another and it might be necessary to split a single
big STCS SSTable into something that TWCS expects
With the motivation out of the way, let's talk about the implementation:
The implementation is quite simple and builds upon the previous patches.
It simply specializes the interposer implementation for regular compaction
with a table-specific interposer.
Fixes#1431
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
This patchseries is part of my effort to make resharding less special -
and hopefully less problematic. The next steps are a bit heavy, so I'd
like to, if possible, get this out of the way.
After these two patches, there is no more need to ever call
reshard_sstables: compact_sstables will do, and it will be able to
recognize resharding compactions.
To do that we need to unify the creator function, which is trivially
done by adding a shard parameter to regular compactions as well: they
can just ignore it. I have considered just making the
compaction_descriptor have a virtual create() function and specializing
it, but because we have to store the creator in the compaction object I
decided to keep the virtual function for now.
In a later cleanup step, if we can for instance store the entire
compaction_descriptor object in the compaction object we could do that.
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Tests: unit tests (dev), dtest (resharding.py)
"
* 'resharding-through-compact-sstables' of github.com:glommer/scylla:
resharding: get rid of special reshard_sstables
compaction: enhance compaction_descriptor with creator and replace function
There are many differences between resharding and compaction that are
artificial, arising more from the way we ended up implementing it than
necessity. This patch attempts to pass the creator and replacer functions
through the compaction_descriptor.
There is a difference between the creator function for resharding and
regular compaction: resharding has to pass the shard number on behalf
of which the SSTable is created. However regular compactions can just
ignore this. No need to have a special path just for this.
After this is done, the constructor for the compaction object can be
greatly simplified. In further patches I intend to simplify it a bit
further, but some more cleanup has to happen first.
To make that happen we have to construct a compaction_descriptor object
inside the resharding function. This is temporary: resharding currently
works with a descriptor, but at some point that descriptor is lost and
broken into pieces to be passed to this function. The overarching goal
of this work is exactly to be able to keep that descriptor for as long
as possible, which should simplify things a lot.
Callers are patched, but there are plenty for sstable_datafile_test.cc.
For their benefit, a helper function is provided to keep the previous
signature (test only).
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Previously schema::get_sharding_info was obtaining
sharding_info from the partitioner but we want to remove
sharding_info from the partitioner so we need a place
in schema to store it there instead.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We now have a utils file for SSTables. This is potentially useful for
other tests.
As a matter of fact, this function is repeated right now for the
resharding test. And to add insult to injury, the version in the
resharding test has the parameters shard and number of tokens flipped,
which although extremely confusing is the predictable outcome of
such repetition
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Remove last usage of this global outside i_partitioner.cc
and hide it inside the compilation unit.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>