This series is a step towards non-LRU cache algorithms.
Our cache items are able to unlink themselves from the LRU list. (In other words, they can be unlinked solely via a pointer to the item, without access to the containing list head). Some places in the code make use of that, e.g. by relying on auto-unlink of items in their destructor.
However, to implement algorithms smarter than LRU, we might want to update some cache-wide metadata on item removal. But any cache-wide structures are unreachable through an item pointer, since items only have access to themselves and their immediate neighbours. Therefore, we don't want items to unlink themselves — we want `cache.remove(item)`, rather than `item.remove_self()`, because the former can update the metadata in `cache`.
This series inserts explicit item unlink calls in places that were previously relying on destructors, gets rid of other self-unlinks, and adds an assert which ensures that every item is explicitly unlinked before destruction.
Closes#11716
* github.com:scylladb/scylladb:
utils: lru: assert that evictables are unlinked before destruction
utils: lru: remove unlink_from_lru()
cache: make all cache unlinks explicit
unlink_from_lru() allows for unlinking elements from cache without notifying
the cache. This messes up any potential cache bookkeeping.
Improved that by replacing all uses of unlink_from_lru() with calls to
lru::remove(), which does have access to cache's metadata.
Our LSA cache is implemented as an auto_unlink Boost intrusive list, meaning
that elements of the list unlink themselves from the list automatically on
destruction. Some parts of the code rely on that, and don't unlink them
manually.
However, this precludes accurate bookkeeping about the cache. Elements only have
access to themselves and their neighbours, not to any bookkeeping context.
Therefore, a destructor cannot update the relevant metadata.
In this patch, we fix this by adding explicit unlink calls to places where it
would be done by a destructor. In a following patch, we will add an assert to
the destructor to check that every element is unlinked before destruction.
This series adds support for detecting collections that have too many items
and recording them in `system.large_cells`.
A configuration variable was added to db/config: `compaction_collection_items_count_warning_threshold` set by default to 10000.
Collections that have more items than this threshold will be warned about and will be recorded as a large cell in the `system.large_cells` table. Documentation has been updated respectively.
A new column was added to system.large_cells: `collection_items`.
Similar to the `rows` column in system.large_partition, `collection_items` holds the number of items in a collection when the large cell is a collection, or 0 if it isn't. Note that the collection may be recorded in system.large_cells either due to its size, like any other cell, and/or due to the number of items in it, if it cross the said threshold.
Note that #11449 called for a new system.large_collections table, but extending system.large_cells follows the logic of system.large_partitions is a smaller change overall, hence it was preferred.
Since the system keyspace schema is hard coded, the schema version of system.large_cells was bumped, and since the change is not backward compatible, we added a cluster feature - `LARGE_COLLECTION_DETECTION` - to enable using it.
The large_data_handler large cell detection record function will populate the new column only when the new cluster feature is enabled.
In addition, unit tests were added in sstable_3_x_test for testing large cells detection by cell size, and large_collection detection by the number of items.
Closes#11449Closes#11674
* github.com:scylladb/scylladb:
sstables: mx/writer: optimize large data stats members order
sstables: mx/writer: keep large data stats entry as members
db: large_data_handler: dynamically update config thresholds
utils/updateable_value: add transforming_value_updater
db/large_data_handler: cql_table_large_data_handler: record large_collections
db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler
db/large_data_handler: cql_table_large_data_handler: move ctor out of line
docs: large-rows-large-cells-tables: fix typos
db/system_keyspace: add collection_elements column to system.large_cells
gms/feature_service: add large_collection_detection cluster feature
test: sstable_3_x_test: add test_sstable_too_many_collection_elements
test: lib: simple_schema: add support for optional collection column
test: lib: simple_schema: build schema in ctor body
test: lib: simple_schema: cql: define s1 as static only if built this way
db/large_data_handler: maybe_record_large_cells: consider collection_elements
db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries
sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells
sstables: mx/writer: add large_data_type::elements_in_collection
db/large_data_handler: get the collection_elements_count_threshold
db/config: add compaction_collection_elements_count_warning_threshold
test: sstable_3_x_test: add test_sstable_write_large_cell
test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler
test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells
test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T
test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random
Since `_partition_size_entry` and `_rows_in_partition_entry`
are accessed at the same time when updated, and similarly
`_cell_size_entry` and `_elements_in_collection_entry`,
place the member pairs closely together to improve data
cache locality.
Follow the same order when preparing the
`scylla_metadata::large_data_stats` map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To save the map lookup on the hot write path,
keep each large data stats entry as a member in the writer
object and build a map for storing the disk_hash in the
scylla metadata only when finalizing it in consume_end_of_stream.
Fixes#11686
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This fixes a regression introduced by 1e7a444, where table::get_sstable_set() isn't exposing all sstables, but rather only the ones in the main set. That causes user of the interface, such as get_sstables_by_partition_key() (used by API to return sstable name list which contains a particular key), to miss files in the maintenance set.
Fixes https://github.com/scylladb/scylladb/issues/11681.
Closes#11682
* github.com:scylladb/scylladb:
replica: Return all sstables in table::get_sstable_set()
sstables: Fix cloning of compound_sstable_set
The intention was that its clone() would actually clone the content
of an existing set into a new one, but the current impl is actually
moving the sets instead of copying them. So the original set
becomes invalid. Luckily, this problem isn't triggered as we're
not exposing the compound set in the table's interface, so the
compound_sstable_set::clone() method isn't being called.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Detect large_collections when the number of collection_elements
is above the configured threshold.
Next step would be to record the number of collection_elements
in the system.large_cells table, when the respective
cluster feature is enabled.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And update the sstable elements_in_collection
stats entry.
Next step would be to forward it to
large_data_handler().maybe_record_large_cells().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a new large_data_stats type and entry for keeping
the collection_elements_count_threshold and the maximum value
of collection_elements.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The low-level `mutation_fragment_stream_validator` gets `reset()` methods that until now only the high-level `mutation_fragment_stream_validating_filter` had.
Active tombstone validation is pushed down to the low level validator.
The low level validator, which was a pain to use until now due to being very fussy on which subset of its API one used, is made much more robust, not requiring the user to stick to a subset of its API anymore.
Closes#11614
* github.com:scylladb/scylladb:
mutation_fragment_stream_validator: make interface more robust
mutation_fragment_stream_validator: add reset() to validating filter
mutation_fragment_stream_validator: move active tomsbtone validation into low level validator
Currently the active range tombstone change is validated in the high
level `mutation_fragment_stream_validating_stream`, meaning that users of
the low-level `mutation_fragment_stream_validator` don't benefit from
checking that tombstones are properly closed.
This patch moves the validation down to the low-level validator (which
is what the high-level one uses under the hood too), and requires all
users to pass information about changes to the active tombstone for each
fragment.
Long-term index caching in the global cache, as introduced in 4.6, is a major
pessimization for workloads where accesses to the index are (spacially) sparse.
We want to have a way to disable it for the affected workloads.
There is already infrastructure in place for disabling it for BYPASS CACHE
queries. One way of solving the issue is hijacking that infrastructure.
This patch adds a global flag (and a corresponding CLI option) which controls
index caching. Setting the flag to `false` causes all index reads to behave
like they would in BYPASS CACHE queries.
Consequences of this choice:
- The per-SSTable partition_index_cache is unused. Every index_reader has
its own, and they die together. Independent reads can no longer reuse the
work of other reads which hit the same index pages. This is not crucial,
since partition accesses have no (natural) spatial locality. Note that
the original reason for partition_index_cache -- the ability to share
reads for the lower and upper bound of the query -- is unaffected.
- The per-SSTable cached_file is unused. Every index_reader has its own
(uncached) input stream from the index file, and every
bsearch_clustered_cursor has its own cached_file, which dies together with
the cursor. Note that the cursor still can perform its binary search with
caching. However, it won't be able to reuse the file pages read by
index_reader. In particular, if the promoted index is small, and fits inside
the same file page as its index_entry, that page will be re-read.
It can also happen that index_reader will read the same index file page
multiple times. When the summary is so dense that multiple index pages fit in
one index file page, advancing the upper bound, which reads the next index
page, will read the same index file page. Since summary:disk ratio is 1:2000,
this is expected to happen for partitions with size greater than 2000
partition keys.
Fixes#11202
Adds support for splitting large partitions during compaction.
Large partitions introduce many problems, like memory overhead and
breaks incremental compaction promise. We want to split large
partitions across fixed-size fragments. We'll allow a partition
to exceed size limit by 10%, as we don't want to unnecessarily split
partitions that just crossed the limit boundary.
To avoid having to open a minimal of 2 fragments in a read, partition
tombstone will be replicated to every fragment storing the
partition.
The splitting isn't enabled by default, and can be used by
strategies that are run aware like ICS. LCS still cannot support
it as it's still using physical level metadata, not run id.
An incremental reader for sstable runs will follow soon.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
After commit 0796b8c97a, sstable_run won't accept a fragment
that introduces key overlapping. But once we split large partitions,
fragments in the same run may store disjoint clustering ranges
of the same partition. So we're extending sstable_run to look
at clustering dimension, so fragments storing disjoint clustering
ranges of the same large partition can co-exist in the same run.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
An element S1 is completely ordered before S2, if S1's last key is
lower than S2's first key.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With first partition's first position and last partition's last
partition, we'll be able to determine which fragments composing a
sstable run store a large partition that was split.
Then sstable run will be able to detect if all fragments storing
a given large partition are disjoint in the clustering level.
Fixes#10637.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The new descriptive name is important to make a distinction when
sstable stores position range for first and last rows instead
of min and max.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Let's introduce a reader_concurrency_semaphore for reading sstable
metadata, to avoid an OOM due to unlimited concurrency.
The concurrency on startup is not controlled, so it's important
to enforce a limit on the amount of memory used by the parallel
readers.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new method allows sstable to load the first row of the first
partition and last row of last partition.
That's useful for incremental reading of sstable run which will
be split at clustering boundary.
To get the first row, it consumes the first row (which can be
either a clustering row or range tombstone change) and returns
its position_in_partition.
To get the last row, it does the same as above but in reverse
mode instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
sstable::get_filename() constructs the filename from components, which
takes some work. It happens to be called on every
index_reader::index_reader() call even though it's only used for TRACE
logs. That's 1700 instructions (~1% of a full query) wasted on every
SSTable read. Fix that.
Closes#11485
Pass the tombstone_gc_state from the compaction_strategy
to sstables get_gc_before_* functions using the table state
to get to the tombstone_gc_state.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
and use it to access the repair history maps.
At this introductory patch, we use default-constructed
tombstone_gc_state to access the thread-local maps
temporarily and those use sites will be replaced
in following patches that will gradually pass
the tombstone_gc_state down from the compaction_manager
to where it's used.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Said method currently emits a partition-end. This method is only called
when the last fragment in the stream is a range tombstone change with a
position after all clustered rows. The problem is that
consume_partition_end() is also called unconditionally, resulting in two
partition-end fragments being emitted. The fix is simple: make this
method a no-op, there is nothing to do there.
Also add two tests: one targeted to this bug and another one testing the
crawling reader with random mutations generated for random schema.
Fixes: #11421Closes#11422
Define table_schema_version as a distinct tagged_uuid class,
So it can be differentiated from other uuid-class types,
in particular table_id.
Added reversed(table_schema_version) for convenience
and uniformity since the same logic is currently open coded
in several places.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that scylla requries c++20 there's no
need to define our own implementation in utils/bit_cast.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We know that sstable_run is supposed to contain disjoint files only,
but this assumption can temporarily break when switching strategies
as TWCS, for example, can incorrectly pick the same run id for
sstables in different windows during segregation. So when switching
from TWCS to ICS, it could happen a sstable_run won't contain disjoint
files. We should definitely fix TWCS and any other strategy doing
that, but sstable_run should have disjointness as actual invariant,
not be relaxed on it. Otherwise, we cannot build readers on this
assumption, so more complicated logic have to be added to merge
overlapping files.
After this patch, sstable_run will reject insertion of a file that
will cause the invariant to break, so caller will have to check
that and push that file into a different sstable run.
Closes#11116
Current code accepts priotity class as an argument to various functions
that need it and all its callers use streaming class. Next patches will
needs to sometimes use default class, but it will require heavy patching
of the distributed loader. Things get simpler if the priority class is
kept on sstable_directory on start.
This change also simplifies the ongoing effort on unification of sched
and IO classes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This work gets us a step closer to compaction groups.
Everything in compaction layer but compaction_manager was converted to table_state.
After this work, we can start implementing compaction groups, as each group will be represented by its own table_state. User-triggered operations that span the entire table, not only a group, can be done by calling the manager operation on behalf of each group and then merging the results, if any.
Closes#11028
* github.com:scylladb/scylla:
compaction: remove forward declaration of replica::table
compaction_manager: make add() and remove() switch to table_state
compaction_manager: make run_custom_job() switch to table_state
compaction_manager: major: switch to table_state
compaction_manager: scrub: switch to table_state
compaction_manager: upgrade: switch to table_state
compaction: table_state: add get_sstables_manager()
compaction_manager: cleanup: switch to table_state
compaction_manager: offstrategy: switch to table_state()
compaction_manager: rewrite_sstables(): switch to table_state
compaction_manager: make run_with_compaction_disabled() switch to table_state
compaction_manager: compaction_reenabler: switch to table_state
compaction_manager: make submit(T) switch to table_state
compaction_manager: task: switch to table_state
compaction: table_state: Add is_auto_compaction_disabled_by_user()
compaction: table_state: Add on_compaction_completion()
compaction: table_state: Add make_sstable()
compaction_manager: make can_proceed switch to table_state
compaction_manager: make stop compaction procedures switch to table_state
compaction_manager: make get_compactions() switch to table_state
compaction_manager: change task::update_history() to use table_state instead
compaction_manager: make can_register_compaction() switch to table_state
compaction_manager: make get_candidates() switch to table_state
compaction_manager: make propagate_replacement() switch to table_state
compaction: Move table::in_strategy_sstables() and switch to table_state
compaction: table_state: Add maintenance sstable set
compaction_manager: make has_table_ongoing_compaction() switch to table_state
compaction_manager: make compaction_disabled() switch to table_state
compaction_manager: switch to table_state for mapping of compaction_state
compaction_manager: move task ctor into source
If the table to remove has no snapshots then
completely remove its directory on storage
as the left-over directory slows down operations on the keyspace
and makes searching for live tables harder.
Fixes#10896
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than using the "pending_delete" string
in `pending_delete_dir_basename()`, so it can
be orderly removed in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Scylla's coding standard requires that each header is self-sufficient,
i.e., it includes whatever other headers it needs - so it can be included
without having to include any other header before it.
We have a test for this, "ninja dev-headers", but it isn't run very
frequently, and it turns out our code deviated from this requirement
in a few places. This patch fixes those places, and after it
"ninja dev-headers" succeeds again.
Fixes#10995
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10997
In order to allow our Scylla OSS customers the ability to select a version for their documentation, we are migrating the Scylla docs content to the Scylla OSS repository. This PR covers the following points of the [Migration Plan](https://docs.google.com/document/d/15yBf39j15hgUVvjeuGR4MCbYeArqZrO1ir-z_1Urc6A/edit#):
1. Creates a subdirectory for dev docs: /docs/dev
2. Moves the existing dev doc content in the scylla repo to /docs/dev, but keep Alternator docs in /docs.
3. Flattens the structure in /docs/dev (remove the subfolders).
4. Adds redirects from `scylla.docs.scylladb.com/<version>/<document>` to `https://github.com/scylladb/scylla/blob/master/docs/dev/<document>.md`
5. Excludes publishing docs for /docs/devs.
1. Enter the docs folder with `cd docs`.
2. Run `make redirects`.
3. Enter the docs folder and run `make preview`. The docs should build without warnings.
4. Open http://127.0.0.1:5500 in your browser. You shoul donly see the alternator docs.
5. Open http://127.0.0.1:5500/stable/design-notes/IDL.html in your browser. It should redirect you to https://github.com/scylladb/scylla/blob/master/docs/dev/IDL.md and raise a 404 error since this PR is not merged yet.
6. Surf the `docs/dev` folder. It should have all the scylla project internal docs without subdirectories.
Closes#10873
* github.com:scylladb/scylla:
Update docs/conf.py
Update docs/dev/protocols.md
Update docs/dev/README.md
Update docs/dev/README.md
Update docs/conf.py
Fix broken links
Remove source folder
Add redirections
Move dev docs to docs/dev
- Use `sstables::generation_type` in more places
- Enforce conceptual separation of `sstables::generation_type` and `int64_t`
- Fix `extremum_tracker` so that `sstables::generation_type` can be non-default-constructible
Fixes#10796.
Closes#10844
* github.com:scylladb/scylla:
sstables: make generation_type an actual separate type
sstables: use generation_type more soundly
extremum_tracker: do not require default-constructible value types
clone staging sstables so their content may be compacted while
views are built. When done, the hard-linked copy in the staging
subdirectory will be simply unlinked.
Fixes#9559
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Need to erase the shared sstable from _sstables
if insertion to _sstables_reversed fails.
Fixes#10787
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that `generation_type` is used properly (at least in some places),
we turn to the compiler to help keep the generation/value separation
intact.
Fixes#10796.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
The sstable set param isn't being used anywhere, and it's also buggy
as sstable run list isn't being updated accordingly. so it could happen
that set contains sstables but run list is empty, introducing
inconsistency.
we're fortunate that the bug wasn't activated as it would've been
a hard one to catch. found this while auditting the code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220617203438.74336-1-raphaelsc@scylladb.com>