We should scan all sstables in the table directory and its
subdirectories to determine the highest sstable version and generation
before using it for creating new sstables (via reshard or reshape).
Otherwise, the generations of new sstables created when populating staging (via reshard or reshape) may collide with generations in the base directory, leading to https://github.com/scylladb/scylladb/issues/11789
Refs scylladb/scylladb#11789
Fixes scylladb/scylladb#11793
Closes#11795
* github.com:scylladb/scylladb:
distributed_loader: populate_column_family: reindent
distributed_loader: coroutinize populate_column_family
distributed_loader: table_population_metadata: start: reindent
distributed_loader: table_population_metadata: coroutinize start_subdir
distributed_loader: table_population_metadata: start_subdir: reindent
distributed_loader: pre-load all sstables metadata for table before populating it
Allowing to change the total or initial resources the semaphore has. After calling `set_resources()` the semaphore will look like as if it was created with the specified amount of resources when created.
Use the new method in `replica::database::revert_initial_system_read_concurrency_boost()` so it doesn't lead to strange semaphore diagnostics output. Currently the system semaphore has 90/100 count units when there are no reads against it, which has led to some confusion.
I also plan on using the new facility in enterprise.
Closes#11772
* github.com:scylladb/scylladb:
replica/database: revert initial boost to system semaphore with set_resources()
reader_concurrency_semaphore: add set_resources()
When moving a SSTable from staging to base dir, we reused the generation
under the assumption that no SSTable in base dir uses that same
generation. But that's not always true.
When reshaping staging dir, reshape compaction can pick a generation
taken by a SSTable in base dir. That's because staging dir is populated
first and it doesn't have awareness of generations in base dir yet.
When that happens, view building will fail to move SSTable in staging
which shares the same generation as another in base dir.
We could have played with order of population, populating base dir
first than staging dir, but the fragility wouldn't be gone. Not
future proof at all.
We can easily make this safe by picking a new generation for the SSTable
being moved from staging, making sure no clash will ever happen.
Fixes#11789.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11790
We should scan all sstables in the table directory and its
subdirectories to determine the highest sstable version and generation
before using it for creating new sstables (via reshard or reshape).
Fixesscylladb/scylladb#11793
Note: table_population_metadata::start_subdir is called
in a seastar thread to facilitate backporting to old versions
that do not support coroutines yet.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Unlike the current method (which uses consume()), this will also adjust the
initial resources, adjusting the semaphore as if it was created with the
reduced amount of resources in the first place. This fixes the confusing
90/100 count resources seen in diagnostics dump outputs.
There's a virtual method on table_state to update the entry in system
keyspace. It's an overkill to facilitate tests that don't want this.
With new system_keyspace weak referencing it can be made simpled by
moving the updating call to the compaction_manager itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a circular dependency between system_keyspace and database. The
former needs the latter because it needs to execula local requests via
query_processor. The latter needs the former via compaction manager and
large data handler, database depends on both and these too need to
insert their entries into system keyspace.
To cut this loop the compaction manager and large data handler both get
a weak reference on the system keysace. Once system keyspace starts is
activcates this reference via the database call. When system keyspace is
shutdown-ed on stop, it deactivates the reference.
Technically the weak reference is implemented by marking the system_k.s.
object as async_sharded_service, and the "reference" in question is the
shared_from_this() pointer. When compaction manager or large data
handler need to update a system keyspace's table, they both hold an
extra reference on the system keyspace until the entry is committed,
thus making sure that sys._k.s. doesn't stop from under their feet. At
the same time, unplugging the reference on shutdown makes sure that no
new entries update will appear and the system_k.s. will eventually be
released.
It's not a C++ classical reference, because system_keyspace starts after
and stops before database.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Yet another user of global qctx object. Making the method(s) non-static requires pushing the system_keyspace all the way down to size_estimate_virtual_reader and a small update of the cql_test_env
Closes#11738
* github.com:scylladb/scylladb:
system_keyspace: Make get_{local|saved}_tokens non static
size_estimates_virtual_reader: Pass sys_ks argument to get_local_ranges()
cql_test_env: Keep sharded<system_keyspace> reference
size_estimate_virtual_reader: Keep system_keyspace reference
system_keyspace: Pass sys_ks argument to install_virtual_readers()
system_keyspace: Make make() non-static
distributed_loader: Pass sys_ks argument to init_system_keyspace()
system_keyspace: Remove dangling forward declaration
This series adds support for detecting collections that have too many items
and recording them in `system.large_cells`.
A configuration variable was added to db/config: `compaction_collection_items_count_warning_threshold` set by default to 10000.
Collections that have more items than this threshold will be warned about and will be recorded as a large cell in the `system.large_cells` table. Documentation has been updated respectively.
A new column was added to system.large_cells: `collection_items`.
Similar to the `rows` column in system.large_partition, `collection_items` holds the number of items in a collection when the large cell is a collection, or 0 if it isn't. Note that the collection may be recorded in system.large_cells either due to its size, like any other cell, and/or due to the number of items in it, if it cross the said threshold.
Note that #11449 called for a new system.large_collections table, but extending system.large_cells follows the logic of system.large_partitions is a smaller change overall, hence it was preferred.
Since the system keyspace schema is hard coded, the schema version of system.large_cells was bumped, and since the change is not backward compatible, we added a cluster feature - `LARGE_COLLECTION_DETECTION` - to enable using it.
The large_data_handler large cell detection record function will populate the new column only when the new cluster feature is enabled.
In addition, unit tests were added in sstable_3_x_test for testing large cells detection by cell size, and large_collection detection by the number of items.
Closes#11449Closes#11674
* github.com:scylladb/scylladb:
sstables: mx/writer: optimize large data stats members order
sstables: mx/writer: keep large data stats entry as members
db: large_data_handler: dynamically update config thresholds
utils/updateable_value: add transforming_value_updater
db/large_data_handler: cql_table_large_data_handler: record large_collections
db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler
db/large_data_handler: cql_table_large_data_handler: move ctor out of line
docs: large-rows-large-cells-tables: fix typos
db/system_keyspace: add collection_elements column to system.large_cells
gms/feature_service: add large_collection_detection cluster feature
test: sstable_3_x_test: add test_sstable_too_many_collection_elements
test: lib: simple_schema: add support for optional collection column
test: lib: simple_schema: build schema in ctor body
test: lib: simple_schema: cql: define s1 as static only if built this way
db/large_data_handler: maybe_record_large_cells: consider collection_elements
db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries
sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells
sstables: mx/writer: add large_data_type::elements_in_collection
db/large_data_handler: get the collection_elements_count_threshold
db/config: add compaction_collection_elements_count_warning_threshold
test: sstable_3_x_test: add test_sstable_write_large_cell
test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler
test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells
test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T
test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random
What's contained in this series:
- Refactored compaction tests (and utilities) for integration with multiple groups
- The idea is to write a new class of tests that will stress multiple groups, whereas the existing ones will still stress a single group.
- Fixed a problem when cloning compound sstable set (cannot be triggered today so I didn't open a GH issue)
- Many changes in replica::table for allowing integration with multiple groups
Next:
- Introduce for_each_compaction_group() for iterating over groups wherever needed.
- Use for_each_compaction_group() in replica::table operations spanning all groups (API, readers, etc).
- Decouple backlog tracker from compaction strategy, to allow for backlog isolation across groups
- Introduce static option for defining number of compaction groups and implement function to map a token to its respective group.
- Testing infrastructure for multiple compaction groups (helpful when testing the dynamic behavior: i.e. merging / splitting).
Closes#11592
* github.com:scylladb/scylladb:
sstable_resharding_test: Switch to table_for_tests
replica: Move compacted_undeleted_sstables into compaction group
replica: Use correct compaction_group in try_flush_memtable_to_sstable()
replica: Make move_sstables_from_staging() robust and compaction group friendly
test: Rename column_family_for_tests to table_for_tests
sstable_compaction_test: Use column_family_for_tests::as_table_state() instead
test: Don't expose compound set in column_family_for_tests
test: Implement column_family_for_tests::table_state::is_auto_compaction_disabled_by_user()
sstable_compaction_test: Merge table_state_for_test into column_family_for_tests
sstable_compaction_test: use table_state_for_test itself in fully_expired_sstables()
sstable_compaction_test: Switch to table_state in compact_sstables()
sstable_compaction_test: Reduce boilerplate by switching to column_family_for_tests
The size-estimate-virtual-reader will need it, now it's available as
"this" from system_keyspace::make() method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This helper needs system_keyspace reference and using "this" as this
looks natural. Also this de-static-ification makes it possible to put
some sense into the invoke_on_all() call from init_system_keyspace()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's final destination is virtual tabls registration code called from
init_system_keyspace() eventually
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This series undoes some recent damage to clarity, then
goes further by renaming terms around dirty_memory_manager
to be clearer. Documentation is added.
Closes#11705
* github.com:scylladb/scylladb:
dirty_memory_manager: re-term "virtual dirty" to "unspooled dirty"
dirty_memory_manager: rename _virtual_region_group
api: column_family: fix memtable off-heap memory reporting
dirty_memory_manager: unscramble terminology
This is the continuation of the a980510654 that tries to catch ENOSPCs reported via storage_io_error similarly to how defer_verbose_shutdown() does on stop
Closes#11664
* github.com:scylladb/scylladb:
table: Handle storage_io_error's ENOSPC when flushing
table: Rewrap retry loop
Compacted undeleted sstables are relevant for avoiding data resurrection
in the purge path. As token ranges of groups won't overlap, it's
better to isolate this data, so to prevent one group from interfering
with another.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We need to pass the compaction_group received as a param, not the one
retrieved via as_table_state(). Needed for supporting multiple
groups.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Off-strategy can happen in parallel to view building.
A semaphore is used to ensure they don't step on each other's
toe.
If off-strategy completes first, then move_sstables_from_staging()
won't find the SSTable alive and won't reach code to add
the file to the backlog tracker.
If view building completes first, the SSTable exists, but it's
not reshaped yet (has repair origin) and shouldn't be
added to the backlog tracker.
Off-strategy completion code will make sure new sstables added
to main set are accounted by the backlog tracker, so
move_sstables_from_staging() only need to add to tracker files
which are certainly not going through a reshape compaction.
So let's take these facts into account to make the procedure
more robust and compaction group friendly. Very welcome change
for when multiple groups are supported.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
make the various large data thresholds live-updateable
and construct the observers and updaters in
cql_table_large_data_handler to dynamically update
the base large_data_handler class threshold members.
Fixes#11685
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This fixes a regression introduced by 1e7a444, where table::get_sstable_set() isn't exposing all sstables, but rather only the ones in the main set. That causes user of the interface, such as get_sstables_by_partition_key() (used by API to return sstable name list which contains a particular key), to miss files in the maintenance set.
Fixes https://github.com/scylladb/scylladb/issues/11681.
Closes#11682
* github.com:scylladb/scylladb:
replica: Return all sstables in table::get_sstable_set()
sstables: Fix cloning of compound_sstable_set
get_sstable_set() as its name implies is not confined to the main
or maintenance set, nor to a specific compaction group, so let's
make it return the compound set which spans all groups, meaning
all sstables tracked by a table will be returned.
This is a regression introduced in 1e7a444. It affects the API
to return sstable list containing a partition key, as sstables
in maintenance would be missed, fooling users of the API like
tools that could trust the output.
Each compaction group is returning the main and maintenance set
in table_state's main_sstable_set() and maintenance_sstable_set(),
respectively.
Fixes#11681.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The "virtual dirty" term is not very informative. "Virtual" means
"not real", but it doesn't say in which way it isn't real.
In this case, virtual dirty refers to real dirty memory, minus
the portion of memtables that has been written to disk (but not
yet sealed - in that case it would not be dirty in the first
place).
I chose to call "the portion of memtables that has been written
to disk" as "spooled memory". At least the unique term will cause
people to look it up and may be easier to remember. From that
we have "unspooled memory".
I plan to further change the accounting to account for spooled memory
rather than unspooled, as that is a more natural term, but that is left
for later.
The documentation, config item, and metrics are adjusted. The config
item is practically unused so it isn't worth keeping compatibility here.
Before 95f31f37c1 ("Merge 'dirty_memory_manager: simplify
region_group' from Avi Kivity"), we had two region_group
objects, one _real_region_group and another _virtual_region_group,
each with a set of "soft" and "hard" limits and related functions
and members.
In 95f31f37c1, we merged _real_region_group into _virtual_region_group,
but unfortunately the _real_region_group members received the "hard"
prefix when they got merged. This overloads the meaning of "hard" -
is it related to soft/hard limit or is it related to the real/virtual
distinction?
This patch applied some renaming to restore consistency. Anything
that came from _virtual_region_group now has "virtual" in its name.
Anything that came from _real_region_group now has "real" in its name.
The terms are still pretty bad but at least they are consistent.
For recording collection_elements of large_collections when
the large_collection_detection feature is enabled.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
region_group evolved as a tree, each node of which contains some
regions (memtables). Each node has some constraints on memory, and
can start flushing and/or stop allocation into its memtables and those
below it when those constraints are violated.
Today, the tree has exactly two nodes, only one of which can hold memtables.
However, all the complexity of the tree remains.
This series applies some mechanical code transformations that remove
the tree structure and all the excess functionality, leaving a much simpler
structure behind.
Before:
- a tree of region_group objects
- each with two parameters: soft limit and hard limit
- but only two instances ever instantiated
After:
- a single region_group object
- with three parameters - two from the bottom instance, one from the top instance
Closes#11570
* github.com:scylladb/scylladb:
dirty_memory_manager: move third memory threshold parameter of region_group constructor to reclaim_config
dirty_memory_manager: simplify region_group::update()
dirty_memory_manager: fold region_group::notify_hard_pressure_relieved into its callers
dirty_memory_manager: clean up region_group::do_update_hard_and_check_relief()
dirty_memory_manager: make do_update_hard_and_check_relief() a member of region_group
dirty_memory_manager: remove accessors around region_group::_under_hard_pressure
dirty_memory_manager: merge memory_hard_limit into region_group
dirty_memory_manager: rename members in memory_hard_limit
dirty_memory_manager: fold do_update() into region_group::update()
dirty_memory_manager: simplify memory_hard_limit's do_update
dirty_memory_manager: drop soft limit / soft pressure members in memory_hard_limit
dirty_memory_manager: de-template do_update(region_group_or_memory_hard_limit)
dirty_memory_manager: adjust soft_limit threshold check
dirty_memory_manager: drop memory_hard_limit::_name
dirty_memory_manager: simplify memory_hard_limit configuration
dirty_memory_manager: fold region_group_reclaimer into {memory_hard_limit,region_group}
dirty_memory_manager: stop inheriting from region_group_reclaimer
dirty_memory_manager: test: unwrap region_group_reclaimer
dirty_memory_manager: change region_group_reclaimer configuration to a struct
dirty_memory_manager: convert region_group_reclaimer to callbacks
dirty_memory_manager: consolidate region_group_reclaimer constructors
dirty_memory_manager: rename {memory_hard_limit,region_group}::notify_relief
dirty_memory_manager: drop unused parameter to memory_hard_limit constructor
dirty_memory_manager: drop memory_hard_limit::shutdown()
dirty_memory_manager: split region_group hierarchy into separate classes
dirty_memory_manager: extract code block from region_group::update
dirty_memory_manager: move more allocation_queue functions out of region_group
dirty_memory_manager: move some allocation queue related function definitions outside class scope
dirty_memory_manager: move region_group::allocating_function and related classes to new class allocation_queue
dirty_memory_manager: remove support for multiple subgroups
The two classes always have a 1:1 or 0:1 relationship, and
so we can just move all the members of memory_hard_limit
into region_group, with the functions that track the relationship
(memory_hard_limit::{add,del}()) removed.
The 0:1 relationship is maintained by initializing the
hard limit parameter with std::numeric_limits<size_t>::max().
The _hard_total_memory variable is always checked if it is
greater than this parameter in order to do anything, and
with this default it can never be.
Commit a9805106 (table: seal_active_memtable: handle ENOSPC error)
made memtable flushing code stand ENOSPC and continue flusing again
in the hope that the node administrator would provide some free space.
However, it looks like the IO code may report back ENOSPC with some
exception type this code doesn't expect. This patch tries to fix it
refs: #11245
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The existing loop is very branchy in its attempts to find out whether or
not to abort. The "allowed_retries" count can be a good indicator of the
decision taken. This makes the code notably shorter and easier to extend
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The logic to reject explicit snapshot of views/indexes
was improved in aa127a2dbb.
However, we never implemented auto-snapshot of
view/indexes when taking a snapshot of the base table.
This is implemented in this patch.
The implementation is built on top of
ba42852b0e
so it would be hard to backport to 5.1 or earlier
releases.
Fixes#11612
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When querier read page with tombstones more than `tombstone_warn_threshold` limit - warning message appeared in logs.
If `tombstone_warn_threshold:0` feature disabled.
Refs scylladb#11410
We observe that memory_hard_limit's reclaim_config is only ever
initialized as default, or with just the hard_limit parameter.
Since soft_limit defaults to hard_limit, we can collapse the two
into a limit. The reclaim callbacks are always left as the default
no-op functions, so we can eliminate them too.
This fits with memory_hard_limit only being responsible for the hard
limit, and for it not having any memtables to reclaim on its own.
region_group_reclaimer is used to initialize (by reference) instances of
memory_hard_limit and region_group. Now that it is a final class, we
can fold it into its users by pasting its contents into those users,
and using the initializer (reclaim_config) to initialize the users. Note
there is a 1:1 relationship between a region_group_reclaimer instance
and a {memory_hard_limit,region_group} instance.
It may seem like code duplication to paste the contents of one class into
two, but the two classes use region_group_reclaimer differently, and most
of the code is just used to glue different classes together, so the
next patches will be able to get rid of much of it.
Some notes:
- no_reclaimer was replaced by a default reclaim_config, as that's how
no_reclaimer was initialized
- all members were added as private, except when a caller required one
to be public
- an under_presssure() member already existed, forwarding to the reclaimer;
this was just removed.
This inheritance makes it harder to get rid of the class. Since
there are no longer any virtual functions in the class (apart from
the destructor), we can just convert it to a data member. In a few
places, we need forwarding functions to make formerly-inherited functions
visible to outside callers.
The virtual destructor is removed and the class is marked final to
verify it is no longer a base class anywhere.
It's just so much nicer.
The "threshold" limit was renamed to "hard_limit" to contrast it with
"soft_limit" (in fact threshold is a good name for soft_limit, since
it's a point where the behavior begins to change, but that's too much
of a change).
region_group_reclaimer is partially policy (deciding when to reclaim)
and partially mechanism (implementing reclaim via virtual functions).
Move the mechanism to callbacks. This will make it easy to fold the
policy part into region_group and memory_hard_limit. This folding is
expected to simplify things since most of region_group_reclaimer is
cross-class communication.
Introduces support to split large partitions during compaction. Today, compaction can only split input data at partition boundary, so a large partition is stored in a single file. But that can cause many problems, like memory pressure (e.g.: https://github.com/scylladb/scylladb/issues/4217), and incremental compaction can also not fulfill its promise as the file storing the large partition can only be released once exhausted.
The first step was to add clustering range metadata for first and last partition keys (retrieved from promoted index), which is crucial to determine disjointness at clustering level, and also the order at which the disjoint files should be opened for incremental reading.
The second step was to extend sstable_run to look at clustering dimension, so a set of files storing disjoint ranges for the same partition can live in the same sstable run.
The final step was to introduce the option for compaction to split large partition being written if it has exceeded the size threshold.
What's next? Following this series, a reader will be implemented for sstable_run that will incrementally open the readers. It can be safely built on the assumption of the disjoint invariant after the second step aforementioned.
Closes#11233
* github.com:scylladb/scylladb:
test: Add test for large partition splitting on compaction
compaction: Add support to split large partitions
sstable: Extend sstable_run to allow disjointness on the clustering level
sstables: simplify will_introduce_overlapping()
test: move sstable_run_disjoint_invariant_test into sstable_datafile_test
test: lib: Fix inefficient merging of mutations in make_sstable_containing()
sstables: Keep track of first partition's first pos and last partition's last pos
sstables: Rename min/max position_range to a descriptive name
sstables_manager: Add sstable metadata reader concurrency semaphore
sstables: Add ability to find first or last position in a partition
Let's introduce a reader_concurrency_semaphore for reading sstable
metadata, to avoid an OOM due to unlimited concurrency.
The concurrency on startup is not controlled, so it's important
to enforce a limit on the amount of memory used by the parallel
readers.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The intention was for these logs to be printed during the
database shutdown sequence, but it was overlooked that it's not
the only place where commitlog::shutdown is called.
Commitlogs are started and shut down periodically by hinted handoff.
When that happens, these messages spam the log.
Fix that by adding INFO commitlog shutdown logs to database::stop,
and change the level of the commitlog::shutdown log call to DEBUG.
Fixes#11508Closes#11536