Supplying a convenience semaphore wrapper, which stops the contained
semaphore when destroyed. It also provides a more convenient
`make_permit()`. This class is intended to make the migration to local
semaphores less painful.
The factory method doesn't match the signature of
`reader_lifecycle_policy::make_reader()`, notably the permit is missing.
Add it as it is important that the wrapping evictable reader and
underlying reader share the permits.
Permits were designed such that there is one permit per read, being
shared by all readers in that read. Make sure readers created by tests
adhere to this.
Naming the concurrency semaphore is currently optional, unnamed
semaphores defaulting to "Unnamed semaphore". Although the most
important semaphores are named, many still aren't, which makes for a
poor debugging experience when one of these times out.
To prevent this, remove the name parameter defaults from those
constructors that have it and require a unique name to be passed in.
Also update all sites creating a semaphore and make sure they use a
unique name.
Since compaction is layered on top of sstables, let's move all compaction code
into a new top-level directory.
This change will give me extra motivation to remove all layer violations, like
sstable calling compaction-specific code, and compaction entanglement with
other components like table and storage service.
Next steps:
- remove all layer violations
- move compaction code in sstables namespace into a new one for compaction.
- move compaction unit tests into its own file
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>
Cassandra 3.0 deprecated the 'sstable_compression' attribute and added
'class' as a replacement. Follow by supporting both.
The SSTABLE_COMPRESSION variable is renamed to SSTABLE_COMPRESSION_DEPRECATED
to detect all uses and prevent future misuse.
To prevent old-version nodes from seeing the new name, the
compression_parameters class preserves the key name when it is
constructed from an options map, and emits the same key name when
asked to generate an options map.
Existing unit tests are modified to use the new name, and a test
is added to ensure the old name is still supported.
Fixes#8948.
Closes#8949
"
The main goal of this series is to improve efficiency of reads from large partitions by
reducing amount of I/O needed to read the sstable index. This is achieved by caching
index file pages and partition index entries in memory.
Currently, the pages are cached by individual reads only for the duration of the read.
This was done to facilitate binary search in the promoted index (intra-partition index).
After this series, all reads share the index file page cache, which stays around even after reads stop.
The page cache is subject to eviction. It uses the same region as the current row cache and shares
the LRU with row cache entries. This means that LRU objects need to be virtualized. This series takes
an easy approach and does this by introducing a virtual base class. This adds an overhead to row cache
entry to store the vtable pointer.
SStable indexes have a hierarchy. There is a summary, which is a sparse partition key index into the
full partition index. This one is already kept in memory. The partition index is divided by the summary
into pages. Each entry in the partition index contains promoted index, which is a sparse index into atoms
identified by the clustering key (rows, tombstones).
In order to read the promoted index, the reader needs to read the partition index entry first.
To speed this up, this series also adds caching of partition index entries. This cache survives
reads and is subject to eviction, just like the index file page cache. The unit of caching is
the partition index page. Without this cache, each access to promoted index would have to be
preceded with the parsing of the partition index page containing the partition key.
Performance testing results follow.
1) scylla-bench large partition reads
Populated with:
perf_fast_forward --run-tests=large-partition-skips --datasets=sb-large-part-ds1 \
-c1 -m1G --populate --value-size=1024 --rows=10000000
Single partition, 9G data file, 4MB index file
Test execution:
build/release/scylla -c1 -m4G
scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1 \
-clustering-row-count 10000000 -duration 60m
TL;DR: after: 2x throughput, 0.5 median latency
Before (c1daf2bb24):
Results
Time (avg): 5m21.033180213s
Total ops: 966951
Total rows: 966951
Operations/s: 3011.997048812112
Rows/s: 3011.997048812112
Latency:
max: 74.055679ms
99.9th: 63.569919ms
99th: 41.320447ms
95th: 38.076415ms
90th: 37.158911ms
median: 34.537471ms
mean: 33.195994ms
After:
Results
Time (avg): 5m14.706669345s
Total ops: 2042831
Total rows: 2042831
Operations/s: 6491.22243800942
Rows/s: 6491.22243800942
Latency:
max: 60.096511ms
99.9th: 35.520511ms
99th: 27.000831ms
95th: 23.986175ms
90th: 21.659647ms
median: 15.040511ms
mean: 15.402076ms
2) scylla-bench small partitions
I tested several scenarios with a varying data set size, e.g. data fully fitting in memory,
half fitting, and being much larger. The improvement varied a bit but in all cases the "after"
code performed slightly better.
Below is a representative run over data set which does not fit in memory.
scylla -c1 -m4G
scylla-bench -workload uniform -mode read -concurrency 400 -partition-count 10000000 \
-clustering-row-count 1 -duration 60m -no-lower-bound
Before:
Time (avg): 51.072411913s
Total ops: 3165885
Total rows: 3165885
Operations/s: 61988.164024260645
Rows/s: 61988.164024260645
Latency:
max: 34.045951ms
99.9th: 25.985023ms
99th: 23.298047ms
95th: 19.070975ms
90th: 17.530879ms
median: 3.899391ms
mean: 6.450616ms
After:
Time (avg): 50.232410679s
Total ops: 3778863
Total rows: 3778863
Operations/s: 75227.58014424688
Rows/s: 75227.58014424688
Latency:
max: 37.027839ms
99.9th: 24.805375ms
99th: 18.219007ms
95th: 14.090239ms
90th: 12.124159ms
median: 4.030463ms
mean: 5.315111ms
The results include the warmup phase which populates the partition index cache, so the hot-cache effect
is dampened in the statistics. See the 99th percentile. Latency gets better after the cache warms up which
moves it lower.
3) perf_fast_forward --run-tests=large-partition-skips
Caching is not used here, included to show there are no regressions for the cold cache case.
TL;DR: No significant change
perf_fast_forward --run-tests=large-partition-skips --datasets=large-part-ds1 -c1 -m1G
Config: rows: 10000000, value size: 2000
Before:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
1 0 36.429822 4 10000000 274500 62 274521 274429 153889.2 153883 19696986 153853 0 0 0 0 0 0 0 22.5%
1 1 36.856236 4 5000000 135662 7 135670 135650 155652.0 155652 19704117 139326 1 0 1 1 0 0 0 38.1%
1 8 36.347667 4 1111112 30569 0 30570 30569 155652.0 155652 19704117 139071 1 0 1 1 0 0 0 19.5%
1 16 36.278866 4 588236 16214 1 16215 16213 155652.0 155652 19704117 139073 1 0 1 1 0 0 0 16.6%
1 32 36.174784 4 303031 8377 0 8377 8376 155652.0 155652 19704117 139056 1 0 1 1 0 0 0 12.3%
1 64 36.147104 4 153847 4256 0 4256 4256 155652.0 155652 19704117 139109 1 0 1 1 0 0 0 11.1%
1 256 9.895288 4 38911 3932 1 3933 3930 100869.2 100868 3178298 59944 38912 0 1 1 0 0 0 14.3%
1 1024 2.599921 4 9757 3753 0 3753 3753 26604.0 26604 801850 15071 9758 0 1 1 0 0 0 14.6%
1 4096 0.784568 4 2441 3111 1 3111 3109 7982.0 7982 205946 3772 2442 0 1 1 0 0 0 13.8%
64 1 36.553975 4 9846154 269359 10 269369 269337 155663.8 155652 19704117 139230 1 0 1 1 0 0 0 28.2%
64 8 36.509694 4 8888896 243467 8 243475 243449 155652.0 155652 19704117 139120 1 0 1 1 0 0 0 26.5%
64 16 36.466282 4 8000000 219381 4 219385 219374 155652.0 155652 19704117 139232 1 0 1 1 0 0 0 24.8%
64 32 36.395926 4 6666688 183171 6 183180 183165 155652.0 155652 19704117 139158 1 0 1 1 0 0 0 21.8%
64 64 36.296856 4 5000000 137753 4 137757 137737 155652.0 155652 19704117 139105 1 0 1 1 0 0 0 17.7%
64 256 20.590392 4 2000000 97133 18 97151 94996 135248.8 131395 7877402 98335 31282 0 1 1 0 0 0 15.7%
64 1024 6.225773 4 588288 94492 1436 95434 88748 46066.5 41321 2324378 30360 9193 0 1 1 0 0 0 15.8%
64 4096 1.856069 4 153856 82893 54 82948 82721 16115.0 16043 583674 11574 2675 0 1 1 0 0 0 16.3%
After:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
1 0 36.429240 4 10000000 274505 38 274515 274417 153887.8 153883 19696986 153849 0 0 0 0 0 0 0 22.4%
1 1 36.933806 4 5000000 135377 15 135385 135354 155658.0 155658 19704085 139398 1 0 1 1 0 0 0 40.0%
1 8 36.419187 4 1111112 30509 2 30510 30507 155658.0 155658 19704085 139233 1 0 1 1 0 0 0 22.0%
1 16 36.353475 4 588236 16181 0 16182 16181 155658.0 155658 19704085 139183 1 0 1 1 0 0 0 19.2%
1 32 36.251356 4 303031 8359 0 8359 8359 155658.0 155658 19704085 139120 1 0 1 1 0 0 0 14.8%
1 64 36.203692 4 153847 4249 0 4250 4249 155658.0 155658 19704085 139071 1 0 1 1 0 0 0 13.0%
1 256 9.965876 4 38911 3904 0 3906 3904 100875.2 100874 3178266 60108 38912 0 1 1 0 0 0 17.9%
1 1024 2.637501 4 9757 3699 1 3700 3697 26610.0 26610 801818 15071 9758 0 1 1 0 0 0 19.5%
1 4096 0.806745 4 2441 3026 1 3027 3024 7988.0 7988 205914 3773 2442 0 1 1 0 0 0 18.3%
64 1 36.611243 4 9846154 268938 5 268942 268921 155669.8 155705 19704085 139330 2 0 1 1 0 0 0 29.9%
64 8 36.559471 4 8888896 243135 11 243156 243124 155658.0 155658 19704085 139261 1 0 1 1 0 0 0 28.1%
64 16 36.510319 4 8000000 219116 15 219126 219101 155658.0 155658 19704085 139173 1 0 1 1 0 0 0 26.3%
64 32 36.439069 4 6666688 182954 9 182964 182943 155658.0 155658 19704085 139274 1 0 1 1 0 0 0 23.2%
64 64 36.334808 4 5000000 137609 11 137612 137596 155658.0 155658 19704085 139258 2 0 1 1 0 0 0 19.1%
64 256 20.624759 4 2000000 96971 88 97059 92717 138296.0 131401 7877370 98332 31282 0 1 1 0 0 0 17.2%
64 1024 6.260598 4 588288 93967 1429 94905 88051 45939.5 41327 2324346 30361 9193 0 1 1 0 0 0 17.8%
64 4096 1.881338 4 153856 81780 140 81920 81520 16109.8 16092 582714 11617 2678 0 1 1 0 0 0 18.2%
4) perf_fast_forward --run-tests=large-partition-slicing
Caching enabled, each line shows the median run from many iterations
TL;DR: We can observe reduction in IO which translates to reduction in execution time,
especially for slicing in the middle of partition.
perf_fast_forward --run-tests=large-partition-slicing --datasets=large-part-ds1 -c1 -m1G --keep-cache-across-test-cases
Config: rows: 10000000, value size: 2000
Before:
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
0 1 0.000491 127 1 2037 24 2109 127 4.0 4 128 2 2 0 1 1 0 0 0 157 80 3058208 15.0%
0 32 0.000561 1740 32 56995 410 60031 47208 5.0 5 160 3 2 0 1 1 0 0 0 386 111 113353 17.5%
0 256 0.002052 488 256 124736 7111 144762 89053 16.6 17 672 14 2 0 1 1 0 0 0 2113 446 52669 18.6%
0 4096 0.016437 61 4096 249199 692 252389 244995 69.4 69 8640 57 5 0 1 1 0 0 0 26638 1717 23321 22.4%
5000000 1 0.002171 221 1 461 2 466 221 25.0 25 268 3 3 0 1 1 0 0 0 638 376 14311524 10.2%
5000000 32 0.002392 404 32 13376 48 13528 13015 27.0 27 332 5 3 0 1 1 0 0 0 931 432 489691 11.9%
5000000 256 0.003659 279 256 69967 764 73130 52563 39.5 41 780 19 3 0 1 1 0 0 0 2689 825 93756 15.8%
5000000 4096 0.018592 55 4096 220313 433 234214 218803 94.2 94 9484 62 9 0 1 1 0 0 0 27349 2213 26562 21.0%
After:
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
0 1 0.000229 115 1 4371 85 4585 115 2.1 2 64 1 1 1 0 0 0 0 0 90 31 1314749 22.2%
0 32 0.000277 2174 32 115674 1015 128109 14144 3.0 3 96 2 1 1 0 0 0 0 0 319 62 52508 26.1%
0 256 0.001786 576 256 143298 5534 179142 113715 14.7 17 544 15 1 1 0 0 0 0 0 2110 453 45419 21.4%
0 4096 0.015498 61 4096 264289 2006 268850 259342 67.4 67 8576 59 4 1 0 0 0 0 0 26657 1738 22897 23.7%
5000000 1 0.000415 233 1 2411 15 2456 234 4.1 4 128 2 2 1 0 0 0 0 0 199 72 2644719 16.8%
5000000 32 0.000635 1413 32 50398 349 51149 46439 6.0 6 192 4 2 1 0 0 0 0 0 458 128 125893 18.6%
5000000 256 0.002028 486 256 126228 3024 146327 82559 17.8 18 1024 13 4 1 0 0 0 0 0 2123 385 51787 19.6%
5000000 4096 0.016836 61 4096 243294 814 263434 241660 73.0 73 9344 62 8 1 0 0 0 0 0 26922 1920 24389 22.4%
Future work:
- Check the impact on non-uniform workloads. Caching sstable indexes takes space away from the row cache
which may reduce the hit ratio.
- Reduce memory footprint of partition index cache. Currently, about 8x bloat over the on-disk size.
- Disable cache population for "bypass cache" reads
- Add a switch to disable sstable index caching, per-node, maybe per-table
- Better sstable index format. Current format leads to inefficiency in caching since only some elements of the cached
page can be hot. A B-tree index would be more efficient. Same applies to the partition index. Only some elements in
the partition index page can be hot.
- Add heuristic for reducing index file IO size when large partitions are anticipated. If we're bound by disk's
bandwidth it's wasteful to read the front of promoted index using 32K IO, better use 4K which should cover the
partition entry and then let binary search read the rest.
In V2:
- Fixed perf_fast_forward regression in the number of IOs used to read partition index page
The reader uses 32K reads, which were split by page cache into 4K reads
Fix by propagating IO size hints to page cache and using single IO to populate it.
New patch: "cached_file: Issue single I/O for the whole read range on miss"
- Avoid large allocations to store partition index page entries (due to managed_vector storage).
There is a unit test which detects this and fails.
Fixed by implementing chunked_managed_vector, based on chunked_vector.
- fixed bug in cached_file::evict_gently() where the wrong allocation strategy was used to free btree chunks
- Simplify region_impl::free_buf() according to Avi's suggestions
- Fit segment_kind in segment_descriptor::_free_space and lift requirement that _buf_pointers emptiness determines the kind
- Workaround sigsegv which was most likely due to coroutine miscompilation. Worked around by manipulating local object scope.
- Wire up system/drop_sstable_caches RESTful API
- Fix use-after-move on permit for the old scanning ka/la index reader
- Fixed more cases of double open_data() in tests leading to assert failure
- Adjusted cached_file class doc to account for changes in behavior.
- Rebased
Fixes#7079.
Refs #363.
"
* tag 'sstable-index-caching-v2' of github.com:tgrabiec/scylla: (39 commits)
api: Drop sstable index caches on system/drop_sstable_caches
cached_file: Issue single I/O for the whole read range on miss
row_cache: cache_tracker: Do not register metrics when constructed for tests
sstables, cached_file: Evict cache gently when sstable is destroyed
sstables: Hide partition_index_cache implementation away from sstables.hh
sstables: Drop shared_index_lists alias
sstables: Destroy partition index cache gently
sstables: Cache partition index pages in LSA and link to LRU
utils: Introduce lsa::weak_ptr<>
sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache
sstables, cached_file: Avoid copying buffers from cache when parsing promoted index
cached_file: Introduce get_page_units()
sstables: read: Document that primitive_consumer::read_32() is alloc-free
sstables: read: Count partition index page evictions
sstables: Drop the _use_binary_search flag from index entries
sstables: index_reader: Keep index objects under LSA
lsa: chunked_managed_vector: Adapt more to managed_vector
utils: lsa: chunked_managed_vector: Make LSA-aware
test: chunked_managed_vector_test: Make exception_safe_class standard layout
lsa: Copy chunked_vector to chunked_managed_vector
...
After 845e36e76 "cql3: Use expr for global-index partition slice",
there is actually more dead code than was initially dropped.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8981
Fixes#8952
In 5ebf5835b0 we added a segment
prune after flushing, to deal with deadlocks in shutdown.
This means that calls that issue sync/flush-like ops "for-all",
need to operate on a defensive copy of the list.
Closes#8980
Commit 5adb8e555c marked the ::feed_hash() and a visitor lambda of
digester::feed_hash() as noexcept. This was quite recklesl as the
appending_hash<>::operator()s called by ::feed_hash() are not all
marked noexcept. In particular, the appending_hash<row>() is not
such and seem to throw.
The original intent of the mentioned commit was to facilitate the
partition_hasher in repair/ code. The hasher itself had been removed
by the 0af7a22c21, so it no longer needs the feed_hash-s to be
noexcepts.
The fix is to inherit noexcept from the called hashers, but for the
digester::feed_hash part the noexcept is just removed until clang
compilation bug #50994 is fixed.
fixes: #8983
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210706153608.4299-1-xemul@scylladb.com>
`query_processor::execute_direct()` takes a non-const ref
to query options, meaning it's not safe to pass the same
instance to subsequent invocations of `execute_direct()`
in the tests.
Copy default query options at each invocation of `execute_cql()`
so no possible side-effects can occur.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210705094824.243573-2-pa.solodovnikov@scylladb.com>
In pull request #8568, the CDC API changed slightly, with preimage data
gaining extra "delete$k" values for columns whose preimage was missing.
In this new test, we verify that this change did not break Alternator.
We didn't expect it to break Alternator, because it just outputs the known
base-table columns and ignores the columns which weren't a real base-table
column - like this "delete$k".
In the test we set up a stream with preimages, ensure that a real column
(note that an LSI key is a real column instead of a map element) has a
null preimage - and see that the preimage is returned as expected,
without fake columns like "delete$k".
The test passes, showing that PR #8568 was ok.
The test also passes, as expected, on DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210504120121.915829-1-nyh@scylladb.com>
All tests in cql-pytest use a test keyspace created with the SimpleStrategy
replication strategy. This was never intentional. We are recommending to
users that they should use NetworkTopologyStrategy instead, and even
want to deprecate SimpleStrategy (this is #8586), so tests should stop
using SimpleStrategy and should start using the same strategy users would
use in their applications - NetworkTopologyStrategy.
Almost all tests are fixed by a single change in conftest.py which
changes how "test_keyspace" is created. But additionally, tests in
test_keyspace.py which explicitly create keyspaces (that's the point of
that test file...) also had to be modified to use NetworkTopologyStrategy.
Note that none of the tests relied on any special features or
implementation details of SimpleStrategy.
This patch is part of the bigger effort to remove reliance on
SimpleStrategy from all tests, of all types - which we will need to do if
we ever want to forbid SimpleStrategy by default. The issue of that effort:
Refs #8638
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210620102341.195533-1-nyh@scylladb.com>
logalloc has a nice leak/double-free sanitizer, with the nice
feature of capturing backtraces to make error reports easy to
track down. But capturing backtraces is itself very expensive.
This patch makes backtrace capture optional, reducing database_test
runtime from 30 minutes to 20 minutes on my machine.
Closes#8978
"
When a corrupted sstable fails to be read either on regular read or in
regular compaction, our logging is not useful as it can't pinpoint
the SSTable that was being read from, also it may not print useful
details about the corruption.
For example, when a compaction fails on data corruption, a cryptic
message as follow will be dumped:
compaction_manager - compaction failed: std::runtime_error (compressed chunk failed checksum): retrying
there are two problems with the log above:
1) it doesn't tell us which sstable is corrupted
2) it doesn't tell us detailed info about the checksum failure on compressed chunk
with those problems fixed, we'll now get a much more useful message:
compaction_manager - compaction failed: sstables::malformed_sstable_exception (Failed to read partition
from SSTable /home/.../md-74-big-Data.db due to compressed chunk of size 3735 at file offset 406491
failed checksum, expected=0, actual=1422312584): retrying
tests: mode(dev).
"
* 'log_data_corruption_v2.1' of github.com:raphaelsc/scylla:
sstables: Attach sstable name to exception triggered in sstable mutation reader
test/broken_sstable_test: Make test more robust
sstables: Make log more useful when compressed chunk fails checksum
sstables: Use correct exception when compressed chunk fails checksum
Another step towards dropping the `restrictions` class. When calculating the partition slice of a global-index table, use `expression` objects instead of a `restrictions` subclass.
Refs #7217.
Tests: unit (all dev, some debug)
Closes#8950
* github.com:scylladb/scylla:
cql3: Use expr for global-index partition slice
cql3: Fully explain statement_restrictions members
cql3: Pass schema reference not pointer
cql3: Replace count_if with find_atom
cql3: Fix _partition_range_is_simple calculation
cql3: Add test cases for indexed partition column
Listing /etc/systemd/system/*.mount as ghost file seems incorrect,
since user may want to keep using RAID volume / coredump directory after
uninstalling Scylla, or user may want to upgrade enterprise version.
Also, we mixed two types of files as ghost file, it should handle differently:
1. automatically generated by postinst scriptlet
2. generated by user invoked scylla_setup
The package should remove only 1, since 2 is generated by user decision.
However, just dropping .mount from %files section causes another
problem, rpm will remove these files during upgrade, instead of
uninstall (#8924).
To fix both problem, specify .mount files as "%ghost %config".
It will keep files both package upgrade and package remove.
See scylladb/scylla-enterprise#1780Closes#8810Closes#8924Closes#8959
An incorrect spelling of max-io-requests was used in creating the
warning message text. Due to conversion to unsigned, the code would
crash due to bad_lexical_cast exception. The spelling of this
configuration name was fixed in the past (44f3ad836b), but only in the
'if' condition.
Fix by using the correct spelling.
Closes#8963
Currently, reading a page range would issue I/O for each missing
page. This is inefficient, better to issue a single I/O for the whole
range and populate cache from that.
As an optimization, issue a single I/O if the first page is missing.
This is important for index reads which optimistically try to read
32KB of index file to read the partition index page.
Some tests will create two cache_tracker instances because of one
being embedded in the sstable test env.
This would lead to double registration of metrics, which raises run
time error. Avoid by not registering metrics in prometheus in tests at
all.