This change reduces the cost of partition index page construction and
LSA migration. This is achieved by several things working together:
- index entries don't store keys as separate small objects (managed_bytes)
They are written into one managed_bytes fragmented storage, entries
hold offset into it.
Before, we paid 16 bytes for managed_bytes plus LSA descriptor for
the storage (1 byte) plus back-reference in the storage (8 bytes),
so 25 bytes. Now we only pay 4 bytes for the size offset. If keys are 16
bytes, that's a reduction from 31 bytes to 20 bytes per key.
- index entries and key storage are now trivially moveable, so LSA
migration can use memcpy() which amortizes the cost per key.
memcpy().
LSA eviction is now trivial and constant time for the whole page
regardless of the number of entries. Page eviction dropped from
14 us to 1 us.
This improves throughput in a CPU-bound miss-heavy read workload where
the partition index doesn't fit in memory.
scylla perf-simple-query -c1 -m200M --partitions=1000000
Before:
15328.25 tps (150.0 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 286769 insns/op, 218134 cycles/op, 0 errors)
15279.01 tps (149.9 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 287696 insns/op, 218637 cycles/op, 0 errors)
15347.78 tps (149.7 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 285851 insns/op, 217795 cycles/op, 0 errors)
15403.68 tps (149.6 allocs/op, 14.1 logallocs/op, 45.2 tasks/op, 285111 insns/op, 216984 cycles/op, 0 errors)
15189.47 tps (150.0 allocs/op, 14.1 logallocs/op, 45.5 tasks/op, 289509 insns/op, 219602 cycles/op, 0 errors)
15295.04 tps (149.8 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 288021 insns/op, 218545 cycles/op, 0 errors)
15162.01 tps (149.8 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 291265 insns/op, 220451 cycles/op, 0 errors)
After:
21620.18 tps (148.4 allocs/op, 13.4 logallocs/op, 43.7 tasks/op, 176817 insns/op, 153183 cycles/op, 0 errors)
20644.03 tps (149.8 allocs/op, 13.5 logallocs/op, 44.3 tasks/op, 187941 insns/op, 160409 cycles/op, 0 errors)
20588.06 tps (150.1 allocs/op, 13.5 logallocs/op, 44.5 tasks/op, 188090 insns/op, 160818 cycles/op, 0 errors)
20789.29 tps (149.5 allocs/op, 13.5 logallocs/op, 44.2 tasks/op, 186495 insns/op, 159382 cycles/op, 0 errors)
20977.89 tps (149.5 allocs/op, 13.4 logallocs/op, 44.2 tasks/op, 183969 insns/op, 158140 cycles/op, 0 errors)
21125.34 tps (149.1 allocs/op, 13.4 logallocs/op, 44.1 tasks/op, 183204 insns/op, 156925 cycles/op, 0 errors)
21244.42 tps (148.6 allocs/op, 13.4 logallocs/op, 43.8 tasks/op, 181276 insns/op, 155973 cycles/op, 0 errors)
Mostly because the index now fits in memory.
When it doesn't, the benefits are still visible due to lower LSA overhead.
get0() dates back from the days where Seastar futures carried tuples, and
get0() was a way to get the first (and usually only) element. Now
it's a distraction, and Seastar is likely to deprecate and remove it.
Replace with seastar::future::get(), which does the same thing.
Move partition_index_cache stats from a thread_local variable
to cache_tracker. After the change, partition_index_cache
receives a reference to the stats via constructor, instead of
referencing a global.
This is needed so that cache_tracker can know the memory usage
of index caches (for cache eviction purposes) without relying on
globals.
But it also makes sense even without that motive.
We have enabled the command line options without changing a
single line of code, we only had to replace old include
with scylla_test_case.hh.
Next step is to add x-log-compaction-groups options, which will
determine the number of compaction groups to be used by all
instantiations of replica::table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When entry loading fails and there is another request blocked on the
same page, attempt to erase the failed entry will abort because that
would violate entry_ptr guarantees, which is supposed to keep the
entry alive.
The fix in 92727ac36c was incomplete. It
only helped for the case of a single loader. This patch makes a more
general approach by relaxing the assert.
The assert manifested like this:
scylla: ./sstables/partition_index_cache.hh:71: sstables::partition_index_cache::entry::~entry(): Assertion `!is_referenced()' failed.
Fixes#10617Closes#10653
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
seastar::later() was recently deprecated and replaced with two
alternatives: a cheap seastar::yield() and an expensive (but more
powerful) seastar::check_for_io_immediately(), that corresponds to
the original later().
This patch replaces all later() calls with the weaker yield(). In
all cases except one, it's unambiguously correct. In one case
(test/perf scheduling_latency_measurer::stop()) it's not so ambiguous,
since check_for_io_immediately() will additionally force a poll and
so will cause more work to be done (but no additional tasks to be
executed). However, I think that any measurement that relies on
the measuring the work on the last tick to be inaccurate (you need
thousands of ticks to get any amount of confidence in the
measurement) that in the end it doesn't matter what we pick.
Tests: unit (dev)
Closes#9904
As part of this change, the container for partition index pages was
changed from utils::loading_shared_values to intrusive_btree. This is
to avoid reactor stalls which the former induces with a large number
of elements (pages) due to its use of a hashtable under the hood,
which reallocates contiguous storage.