scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-03 21:47:10 +00:00

Author	SHA1	Message	Date
Vladimir Krivopalov	3a9cb54c76	Merge the pair of index_readers into just one tracking a range. Historically, we had two index_readers per a sstable_mutation_reader, one for the lower bound and one for the upper bound. Most of public members of the index_reader class were only called on either of those. With the changes introduced in #2981, two readers are even more tied together as they now have a shared-per-pair list of index pages that needs proper cleanup and was protruding woefully into the caller code. This fix re-structures index_reader so that it now keeps track of both lower and upper bounds. The shared_index_lists structure is encapsulated within index_reader and becomes an internal detail rather than a liability. Fixes #3220. Tests: unit (debug, release) + Tested using cassandra-stress commands from #3189. perf_fast_forward results indicate there is no performance degradation caused by thix fix. =========================== Baseline =================================== running: large-partition-skips Testing scanning large partition with skips. Reads whole range interleaving reads with skips according to read-skip pattern: read skip time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 0.494458 1000000 2022418 1018 126960 27 0 0 0 0 0 0 0 97.6% 1 1 1.754717 500000 284946 997 127064 6 0 0 3 3 0 0 0 99.9% 1 8 0.551664 111112 201413 997 127064 6 0 0 3 3 0 0 0 99.7% 1 16 0.383888 58824 153232 1001 127080 10 0 0 5 5 0 0 0 99.5% 1 32 0.289073 30304 104832 997 127064 28 0 0 3 3 0 0 0 99.3% 1 64 0.236963 15385 64926 997 127064 122 0 0 3 3 0 0 0 99.2% 1 256 0.172901 3892 22510 997 127064 217 0 0 3 3 0 0 0 95.5% 1 1024 0.117570 976 8301 997 127064 235 0 0 3 3 0 0 0 49.0% 1 4096 0.085811 245 2855 664 27172 375 274 0 3 3 0 0 0 21.4% 64 1 0.512781 984616 1920149 1142 127064 139 0 0 3 3 0 0 0 98.7% 64 8 0.479232 888896 1854833 1001 127080 10 0 0 5 5 0 0 0 99.6% 64 16 0.451193 800000 1773078 997 127064 6 0 0 3 3 0 0 0 99.6% 64 32 0.408684 666688 1631305 997 127064 6 0 0 3 3 0 0 0 99.5% 64 64 0.351906 500032 1420924 997 127064 14 0 0 3 3 0 0 0 99.5% 64 256 0.227008 200000 881026 997 127064 211 0 0 3 3 0 0 0 99.1% 64 1024 0.125803 58880 468032 997 127064 290 0 0 3 3 0 0 0 65.1% 64 4096 0.098155 15424 157139 703 27856 401 267 0 3 3 0 0 0 25.8% running: large-partition-slicing Testing slicing of large partition: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.000701 1 1427 9 296 6 4 0 3 3 0 0 0 12.4% 0 32 0.000698 32 45827 9 296 6 3 0 3 3 0 0 0 13.9% 0 256 0.000808 256 316920 10 328 6 3 0 3 3 0 0 0 24.9% 0 4096 0.004368 4096 937697 25 808 14 3 0 3 3 0 0 0 45.9% 500000 1 0.001196 1 836 13 412 9 4 0 3 3 0 0 0 22.7% 500000 32 0.001200 32 26664 13 412 9 4 0 3 3 0 0 0 22.2% 500000 256 0.001503 256 170338 14 444 10 4 0 3 3 0 0 0 25.3% 500000 4096 0.004351 4096 941465 30 956 20 4 0 3 3 0 0 0 50.7% running: large-partition-slicing-clustering-keys Testing slicing of large partition using clustering keys: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.000625 1 1601 7 176 6 0 0 3 3 0 0 0 23.2% 0 32 0.000604 32 53016 7 176 6 0 0 3 3 0 0 0 24.7% 0 256 0.000695 256 368498 8 180 6 0 0 3 3 0 0 0 36.4% 0 4096 0.004083 4096 1003106 20 692 12 1 0 3 3 0 0 0 47.0% 500000 1 0.001198 1 835 12 516 9 3 0 3 3 0 0 0 22.8% 500000 32 0.000981 32 32631 12 388 9 3 0 3 3 0 0 0 29.2% 500000 256 0.001320 256 194011 13 384 10 3 0 3 3 0 0 0 29.0% 500000 4096 0.003944 4096 1038567 25 840 17 2 0 3 3 0 0 0 52.2% running: large-partition-slicing-single-key-reader Testing slicing of large partition, single-partition reader: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.000849 1 1178 9 488 6 0 0 3 3 0 0 0 16.5% 0 32 0.000661 32 48415 9 296 6 0 0 3 3 0 0 0 22.2% 0 256 0.000756 256 338648 10 328 6 0 0 3 3 0 0 0 33.3% 0 4096 0.004147 4096 987610 22 840 12 1 0 3 3 0 0 0 47.9% 500000 1 0.001041 1 960 13 476 9 3 0 3 3 0 0 0 25.9% 500000 32 0.001020 32 31375 13 412 9 3 0 3 3 0 0 0 29.1% 500000 256 0.001265 256 202373 14 444 10 3 0 3 3 0 0 0 32.0% 500000 4096 0.004121 4096 994014 30 988 18 3 0 3 3 0 0 0 52.7% running: large-partition-select-few-rows Testing selecting few rows from a large partition: stride rows time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1000000 1 0.000668 1 1498 9 296 6 4 0 3 3 0 0 0 19.8% 500000 2 0.000976 2 2048 13 412 9 4 0 3 3 0 0 0 29.0% 250000 4 0.001408 4 2842 18 572 12 6 0 3 3 0 0 0 28.8% 125000 8 0.002004 8 3993 29 912 19 10 0 3 3 0 0 0 34.0% 62500 16 0.002883 16 5551 50 1584 32 18 0 3 3 0 0 0 41.9% 2 500000 1.053215 500000 474737 1138 127080 120 0 0 5 5 0 0 0 99.7% running: large-partition-forwarding Testing forwarding with clustering restriction in a large partition: pk-scan time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu yes 0.002717 2 736 24 2684 8 16 0 3 3 0 0 0 19.7% no 0.001004 2 1992 13 412 8 2 0 3 3 0 0 0 30.2% running: small-partition-skips Testing scanning small partitions with skips. Reads whole range interleaving reads with skips according to read-skip pattern: read skip time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu -> 1 0 1.466523 1000000 681885 1369 139732 33 1 0 0 0 0 0 0 99.7% -> 1 1 12.792183 500000 39086 6235 177736 5155 0 0 5123 7663 0 0 0 96.4% -> 1 8 3.451431 111112 32193 6235 177736 5155 0 0 5123 9673 0 0 0 84.8% -> 1 16 2.223815 58824 26452 6234 177704 5154 0 0 5122 9965 0 0 0 75.0% -> 1 32 1.512511 30304 20036 6233 177680 5155 1 0 5123 10090 0 0 0 61.8% -> 1 64 1.129465 15385 13621 6227 177464 5154 0 0 5122 10159 0 0 0 49.5% -> 1 256 0.733282 3892 5308 6211 175464 5178 24 0 5122 10220 0 0 0 33.8% -> 1 1024 0.397302 976 2457 5946 142152 5369 217 0 5120 10235 0 0 0 32.1% -> 1 4096 0.187746 245 1305 5499 81992 5296 142 0 5122 10240 0 0 0 46.8% -> 64 1 2.428488 984616 405444 7332 177736 5155 25 0 5123 5208 0 0 0 79.9% -> 64 8 2.262876 888896 392817 6235 177736 5155 0 0 5123 5654 0 0 0 78.1% -> 64 16 2.137544 800000 374261 6234 177732 5154 0 0 5122 6110 0 0 0 77.1% -> 64 32 1.862466 666688 357960 6235 177736 5155 0 0 5123 6844 0 0 0 73.7% -> 64 64 1.547757 500032 323069 6234 177728 5155 0 0 5123 7651 0 0 0 68.7% -> 64 256 0.914612 200000 218672 6233 177704 5154 0 0 5122 9202 0 0 0 55.5% -> 64 1024 0.475472 58880 123835 6229 177492 5154 5 0 5122 9930 0 0 0 45.4% -> 64 4096 0.271239 15424 56865 6158 169480 5257 114 0 5115 10142 0 0 0 44.1% running: small-partition-slicing Testing slicing small partitions: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.003209 1 312 3 260 2 7 0 1 1 0 0 0 15.5% 0 32 0.004205 32 7610 16 1428 10 0 0 5 5 0 0 0 15.7% 0 256 0.009830 256 26042 97 8572 62 0 0 31 31 0 0 0 18.7% 0 4096 0.015471 4096 264748 100 8704 64 0 0 32 32 0 0 0 48.4% 500000 1 0.003654 1 274 34 492 33 0 0 32 64 0 0 0 28.7% 500000 32 0.004287 32 7464 40 1260 36 0 0 32 64 0 0 0 26.0% 500000 256 0.009598 256 26673 100 8748 64 4 0 32 64 0 0 0 20.6% 500000 4096 0.014151 4096 289449 119 7892 85 0 0 53 64 0 0 0 54.1% ======================== With the patch ================================ running: large-partition-skips Testing scanning large partition with skips. Reads whole range interleaving reads with skips according to read-skip pattern: read skip time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 0.468887 1000000 2132711 1018 126960 29 0 0 0 0 0 0 0 98.4% 1 1 1.735113 500000 288166 1001 127080 10 0 0 5 5 0 0 0 99.9% 1 8 0.535616 111112 207447 997 127064 6 0 0 3 3 0 0 0 99.6% 1 16 0.365487 58824 160947 1001 127080 15 0 0 5 5 0 0 0 99.5% 1 32 0.272208 30304 111326 997 127064 21 0 0 3 3 0 0 0 99.3% 1 64 0.224049 15385 68668 997 127064 208 0 0 3 3 0 0 0 99.1% 1 256 0.159247 3892 24440 997 127064 250 0 0 3 3 0 0 0 94.7% 1 1024 0.102107 976 9559 997 127064 292 0 0 3 3 0 0 0 53.6% 1 4096 0.084310 245 2906 664 27172 371 273 0 3 3 0 0 0 20.2% 64 1 0.508340 984616 1936923 1142 127064 129 0 0 3 3 0 0 0 98.1% 64 8 0.470369 888896 1889786 997 127064 6 0 0 3 3 0 0 0 99.6% 64 16 0.439917 800000 1818526 1001 127080 10 0 0 5 5 0 0 0 99.6% 64 32 0.397938 666688 1675358 997 127064 6 0 0 3 3 0 0 0 99.5% 64 64 0.344144 500032 1452972 997 127064 18 0 0 3 3 0 0 0 99.4% 64 256 0.219996 200000 909107 997 127064 251 0 0 3 3 0 0 0 99.1% 64 1024 0.124294 58880 473715 997 127064 284 1 0 3 3 0 0 0 62.2% 64 4096 0.097580 15424 158065 703 27856 400 267 0 3 3 0 0 0 25.3% running: large-partition-slicing Testing slicing of large partition: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.000733 1 1365 9 296 6 4 0 3 3 0 0 0 19.3% 0 32 0.000705 32 45417 9 296 6 3 0 3 3 0 0 0 15.3% 0 256 0.000830 256 308364 10 328 6 3 0 3 3 0 0 0 26.7% 0 4096 0.004631 4096 884529 25 808 14 3 0 3 3 0 0 0 48.1% 500000 1 0.001184 1 845 13 412 9 4 0 3 3 0 0 0 23.7% 500000 32 0.001199 32 26690 13 412 9 4 0 3 3 0 0 0 21.9% 500000 256 0.001530 256 167296 14 444 10 4 0 3 3 0 0 0 26.8% 500000 4096 0.004379 4096 935474 30 956 19 4 0 3 3 0 0 0 51.5% running: large-partition-slicing-clustering-keys Testing slicing of large partition using clustering keys: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.000620 1 1614 7 176 6 0 0 3 3 0 0 0 27.4% 0 32 0.000625 32 51218 7 176 6 0 0 3 3 0 0 0 27.0% 0 256 0.000701 256 365148 8 180 6 0 0 3 3 0 0 0 35.2% 0 4096 0.004063 4096 1008130 20 692 12 1 0 3 3 0 0 0 47.6% 500000 1 0.001208 1 827 12 516 9 3 0 3 3 0 0 0 24.3% 500000 32 0.000973 32 32876 12 388 9 3 0 3 3 0 0 0 28.7% 500000 256 0.001315 256 194612 13 384 10 3 0 3 3 0 0 0 29.0% 500000 4096 0.003950 4096 1037068 25 840 17 2 0 3 3 0 0 0 52.7% running: large-partition-slicing-single-key-reader Testing slicing of large partition, single-partition reader: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.000844 1 1185 9 488 6 0 0 3 3 0 0 0 16.5% 0 32 0.000656 32 48753 9 296 6 0 0 3 3 0 0 0 23.1% 0 256 0.000751 256 341011 10 328 6 0 0 3 3 0 0 0 34.0% 0 4096 0.004173 4096 981632 22 840 12 1 0 3 3 0 0 0 47.0% 500000 1 0.001036 1 966 13 476 9 3 0 3 3 0 0 0 25.4% 500000 32 0.001014 32 31573 13 412 9 3 0 3 3 0 0 0 27.4% 500000 256 0.001280 256 200044 14 444 10 3 0 3 3 0 0 0 31.8% 500000 4096 0.004081 4096 1003746 30 988 18 3 0 3 3 0 0 0 51.6% running: large-partition-select-few-rows Testing selecting few rows from a large partition: stride rows time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1000000 1 0.000668 1 1498 9 296 6 3 0 3 3 0 0 0 21.7% 500000 2 0.000958 2 2088 13 412 9 4 0 3 3 0 0 0 27.7% 250000 4 0.001495 4 2676 18 572 12 6 0 3 3 0 0 0 25.8% 125000 8 0.002069 8 3866 29 912 19 10 0 3 3 0 0 0 30.8% 62500 16 0.002856 16 5603 50 1584 32 18 0 3 3 0 0 0 41.7% 2 500000 1.063129 500000 470310 1138 127080 120 0 0 5 5 0 0 0 99.7% running: large-partition-forwarding Testing forwarding with clustering restriction in a large partition: pk-scan time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu yes 0.002567 2 779 24 2684 8 16 0 3 3 0 0 0 21.5% no 0.001013 2 1975 13 412 8 2 0 3 3 0 0 0 28.9% running: small-partition-skips Testing scanning small partitions with skips. Reads whole range interleaving reads with skips according to read-skip pattern: read skip time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu -> 1 0 1.349959 1000000 740763 1369 139732 33 1 0 0 0 0 0 0 99.7% -> 1 1 12.640751 500000 39555 8144 191168 7064 0 0 7032 11481 0 0 0 96.2% -> 1 8 3.404269 111112 32639 6651 180660 5571 0 0 5539 10505 0 0 0 84.5% -> 1 16 2.175424 58824 27040 6434 179116 5354 0 0 5322 10365 0 0 0 74.3% -> 1 32 1.493365 30304 20292 6335 178404 5257 0 0 5225 10294 0 0 0 61.1% -> 1 64 1.112168 15385 13833 6256 177672 5183 0 0 5151 10217 0 0 0 48.7% -> 1 256 0.719282 3892 5411 6211 175464 5178 24 0 5122 10220 0 0 0 33.3% -> 1 1024 0.393236 976 2482 5946 142152 5369 217 0 5120 10235 0 0 0 30.7% -> 1 4096 0.185284 245 1322 5499 81992 5296 142 0 5122 10240 0 0 0 44.7% -> 64 1 2.356711 984616 417792 7361 177944 5184 21 0 5152 5266 0 0 0 79.1% -> 64 8 2.192331 888896 405457 6253 177868 5173 0 0 5141 5690 0 0 0 77.2% -> 64 16 2.029835 800000 394121 6245 177812 5165 0 0 5133 6132 0 0 0 75.7% -> 64 32 1.806448 666688 369060 6245 177808 5165 0 0 5133 6864 0 0 0 72.6% -> 64 64 1.508492 500032 331478 6242 177788 5163 0 0 5131 7667 0 0 0 67.7% -> 64 256 0.892881 200000 223994 6233 177704 5154 0 0 5122 9202 0 0 0 54.2% -> 64 1024 0.465715 58880 126429 6229 177492 5154 0 0 5122 9930 0 0 0 44.0% -> 64 4096 0.266582 15424 57858 6158 169480 5257 114 0 5115 10142 0 0 0 42.3% running: small-partition-slicing Testing slicing small partitions: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.003113 1 321 3 260 2 0 0 1 1 0 0 0 13.4% 0 32 0.004166 32 7682 16 1428 10 0 0 5 5 0 0 0 14.9% 0 256 0.009813 256 26088 97 8572 62 0 0 31 31 0 0 0 18.4% 0 4096 0.014798 4096 276794 100 8704 64 0 0 32 32 0 0 0 46.3% 500000 1 0.003700 1 270 34 492 33 0 0 32 64 0 0 0 28.4% 500000 32 0.004030 32 7940 40 1260 36 0 0 32 64 0 0 0 27.8% 500000 256 0.009514 256 26908 100 8748 64 0 0 32 64 0 0 0 20.2% 500000 4096 0.013368 4096 306413 119 7892 85 0 0 53 64 0 0 0 53.6% Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> Message-Id: <a72818f79ca4081a606424545b0053fa581d49e7.1522173144.git.vladimir@scylladb.com>	2018-03-29 15:23:31 +03:00
Vladimir Krivopalov	c996191411	Close promoted index streams when closing index_readers. Promoted index input streams must be explicitly closed when closing the index_reader in order to ensure all the pending read-aheads are completed. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-02-20 16:04:15 -08:00
Vladimir Krivopalov	71495691aa	Use separate shared_index_lists per sstable_mutation_reader instead of a single one per sstable. With the changes introduced in #2981, it is no longer safe to share index_entries among multiple sstable_mutation_readers. The original intent behind sharing index_entries among index_readers was to avoid re-reading same pages twice as we have two index readers - lower and upper bound - for every sstable_mutation_reader. In fact, the shared entries were held at the sstable object level so index_readers from different sstable_mutation_readers could have accessed them. Now, with calls to index_reader::advance_to(pos)/index_reader::advance_past(pos), index_entry can be accessed in a way that modifies its state if we need to read more promoted index blocks. It is safe to keep sharing them between two index_readers within the same sstable_mutation_reader as the invariant is maintained that readers can be only moved forward. We cannot safely assume, however, that this invariant holds for multiple sstable_mutation_readers as it may happen that one of them has read and thrown away some promoted index blocks that another one needs. So we restrict sharing to per-sstable_mutation_reader level. Fixes #3189. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> Message-Id: <83957d007621fe4c62af49aebf1838bb2f32ee55.1518226793.git.vladimir@scylladb.com>	2018-02-10 15:08:45 +02:00
Vladimir Krivopalov	b91c3fd47e	Use advance_past for single partition upper bound. Instead of advancing to the next partition, try first find the more precise position using promoted index blocks. advance_past() only seeks within currently available PI blocks (or reads the first batch, if never read before) and uses the position if found, otherwise resorts to advance_to_next_partition() Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-01-29 11:57:45 -08:00
Vladimir Krivopalov	0a7a56edd5	Simplify continuous_data_consumer::consume_input() interface. Remove redundant input parameter as continuous_data_consumer derivatives would only use themselves as a context. So take it internally and make the function regular (non-template) and having no parameters. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-01-29 11:57:26 -08:00
Vladimir Krivopalov	7e15e436de	Parse promoted index entries lazily upon request rather than immediately. Now promoted index is converted into an input_stream and skipped over instead of being consumed immediately and stored as a single buffer. The only part that is read right away is the deletion time as it is likely to be there in the already read buffer and reading it should both be cheap and prevent from reading the whole promoted index if only deletion time mark is needed. When accessed, promoted index is parsed in chunks, buffer by buffer, to limit memory consumption. Fixes #2981 Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-01-29 11:57:15 -08:00
Duarte Nunes	f217dcc0ce	sstables/sstables: Don't use incorrectly serialized promoted index Promoted indexes generated before this patch by Scylla are considered incorrect if they belong to a non-compound schema, due to #2993. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-11-23 16:45:53 +00:00
Tomasz Grabiec	2113299b61	sstables: index_reader: Reset lower bound for promoted index lookups from advance_to_next_partition() _current_pi_idx was not reset from advance_to_next_partition(), which is used when we skip to the next partition before fully consuming it. As a result, if we try to skip to a clustering position which is before the index block used by the last skip in the previous partition, we would not skip assuming that the new position is in the current block. This may result in more data being read from the sstable than necessary. Fixes #2984 Message-Id: <1510915793-20159-1-git-send-email-tgrabiec@scylladb.com>	2017-11-17 11:00:26 +00:00
Botond Dénes	046a1f9b05	sstables: Get rid of [[deprecated]] index_reader::get_index_entries() Change test code (the only consumers) to read index by partitions. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <b6111e92b5e0729bfa2e76fd848215804174067a.1507297154.git.bdenes@scylladb.com>	2017-10-08 12:18:52 +03:00
Raphael S. Carvalho	050a7019b8	sstables/index_reader: fix index reader for summary entry spanning lots of keys quantity prevents index_reader from reading all index entries of a summary entry that span more than min_index_interval entries. That can happen after introduction of size-based sampling, and consequently, sstable will not be able to return a key which logical position in summary entry is beyond min_index_interval. It's ok to not use quantity because index_reader will read all indexes until either next summary entry or end of file is reached. Fixes test_sstable_conforms_to_mutation_source Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170812045821.25269-1-raphaelsc@scylladb.com>	2017-08-12 09:44:16 +03:00
Paweł Dziepak	960a140880	index_reader: advance_and_check_if_present() use index_comparator	2017-07-26 14:36:37 +01:00
Paweł Dziepak	dc7bad9a50	sstables: cache token in index entries When a sstable reader is fast forwarded some index entries may be read (and compared) multiple times. This patch makes sure that once a token is computed we keep it around and reuse if the entry is accessed again.	2017-07-26 14:36:37 +01:00
Paweł Dziepak	bfb7b56c74	sstable: keep a pre-computed token in summary_entry Each sstable index lookup involves a binary search in the summary and each time a partition key of summary entry is compared with anything its token needs to be calculated. Since we keep summary in the memory all the time it is better to also keep the tokens around.	2017-07-26 14:36:36 +01:00
Tomasz Grabiec	297b4b0cf5	sstables: index_reader: Remove redundant function	2017-05-04 14:59:08 +02:00
Tomasz Grabiec	ec45f1e51d	sstables: index_reader: Fix abort in advance_and_check_if_present() Happens when the key is missing and after all keys in the sstables. Fixes #2345.	2017-05-04 14:59:08 +02:00
Tomasz Grabiec	c5baeed6d2	sstables: Improve logging	2017-04-27 18:43:49 +02:00
Tomasz Grabiec	b523815ac1	sstables: index_reader: Fix advance_to() to include relevant range tombstones Fixes #2326.	2017-04-27 18:43:49 +02:00
Tomasz Grabiec	0b5ba13230	sstables: index_reader: Introduce advance_to_next_partition()	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	4b81844d2e	sstables: index_reader: Introduce advance_and_check_if_present()	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	b92f095bf0	sstables: index_reader: Introduce advance_past()	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	6780756258	sstables: index_reader: Make copyable	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	7db83fa3fe	sstables: index_reader: Optimize advancing to extreme positions	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	f66443c01c	sstables: index_reader: Keep two last pages alive The idea behind caching is that when we have two index readers where one is catching up with the other, each page will be read only once. Currently that's not always the case. There is a case when advance_to() may need to read two pages. That's when the target position is not found in the first page as determined by the summary index. The second reader which catches up would have to read the first page as well, but it would not be in cache any more. To avoid this extra I/O let's keep a reference to the two last pages touched by the index.	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	e35fe7492c	sstables: index_reader: Expose access to partition key and tombstone	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	3fbc0bed6e	sstables: sstable_streamed_mutation: use index in fast_forward_to()	2017-03-28 18:34:55 +02:00
Tomasz Grabiec	a9252dfc58	sstables: Use separate index readers for lower and upper bounds So that lower bound can be advanced within the range.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	27d86dfe18	sstables: Enable skipping to cells at data_consume_context level	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	aad943523a	sstables: index_reader: Add trace-level logging	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	1dbd2e239e	sstables: index_reader: Share index lists among other index readers Direct motivation for this is to be able to use two index readers from a single mutation reader, one for lower bound of the range and one for the upper bound of the range, without sacrificing optimization of avoiding index reads when forwarding to partition ranges which are close by. After the change, all index readers of given sstable will share index buffers, so lower bound reader can reuse the page read by the upper bound reader. The reason for using two readers will be so that we are able to skip inside the partition range, not only outside of it. This is not possible if we use the same index reader to locate the upper bound of the range, because we may only advance the cursor.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	e36979da47	sstables: index_reader: Use sstable's schema Makes for a simpler interface.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	e3e2f037bb	sstables: index_reader: Refactor around the concept of a cursor Index reader already can be queried only with monotonic positions, so the concept of a cursor is ingrained. Making it explicit will make it easier to define behavior for forwarding withing the partition. After the change: - lower_bound() is renamed to advance_to() and doesn't return the position, only advances the cursor - data file position for partition under cursor can be obtained at any time with data_file_position()	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	27862fa8f6	sstables: index_reader: Narrow down summary range during lookup Positions passed to lower_bound() must be non-decreasing, so summary indexes as well.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	02ace99798	sstables: index_reader: Change lookup to work on ring_position_view In preparation for changing the interface to work not only with ranges.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	6c75614d19	sstables: Fix input_stream not being closed by index_reader Fixes #2022 Message-Id: <1484912679-5729-1-git-send-email-tgrabiec@scylladb.com>	2017-01-20 11:58:33 +00:00
Asias He	e5485f3ea6	Get rid of query::partition_range Use dht::partition_range instead	2016-12-19 08:09:25 +08:00
Gleb Natapov	ae0a2935b4	sstables: fix ad-hoc summary creation If sstable Summary is not present Scylla does not refuses to boot but instead creates summary information on the fly. There is a bug in this code though. Summary files is a map between keys and offsets into Index file, but the code creates map between keys and Data file offsets instead. Fix it by keeping offset of an index entry in index_entry structure and use it during Summary file creation. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20161116165421.GA22296@scylladb.com>	2016-11-17 11:05:23 +02:00
Paweł Dziepak	20bfa1fa52	sstables: drop sstable::{lower, upper}_bound() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	a530762277	sstables: introduce index_reader index_reader is a helper that implements index lookups. Its goal is to avoid dropping read buffers if they still may be needed (for example to get end bound of the range or after fast forwarding the reader). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	0bc873ace5	sstables: add fast_forward_to() to continuous_data_consumer Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Glauber Costa	0f41ef1b84	index_reader: avoid misleading parent name Also add comments about the expected signature of IndexConsumer Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-12 11:15:11 -04:00
Glauber Costa	0de3a32147	index reader: make index_consumer a template parameter This is done so we can use other consumers. An example of that, is regeneration of the Summary from an existing Index. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-08 17:14:29 -04:00
Pekka Enberg	38a54df863	Fix pre-ScyllaDB copyright statements People keep tripping over the old copyrights and copy-pasting them to new files. Search and replace "Cloudius Systems" with "ScyllaDB". Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>	2016-04-08 08:12:47 +03:00
Avi Kivity	d5cf0fb2b1	Add license notices	2015-09-20 10:43:39 +03:00
Glauber Costa	aab1ae9dc1	index_entry: don't generate a temporary bytes element The one thing that is still showing pretty high at the read_indexes flamegraph, is allocations. We can, however, do better. Since most of the index is the keys anyway - and we need all of them, the amount of memory we use by copying the buffers over is about the same as the space we would use by just keeping the buffers around. So we can change index_entry to just keep the shared_buffers, and since we always access it through views anyway, that is perfectly fine. The index_entry destructor will then release() the temporary_buffer, instead of doing this after the buffer copy. This gives us a nice additional 4 %. perf_sstable_g --smp 1 --iterations 30 --parallelism 1 --mode index_read Before: 839484.65 +- 585.52 partitions / sec (30 runs, 1 concurrent ops) After: 873323.18 +- 442.52 partitions / sec (30 runs, 1 concurrent ops) Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-29 14:09:53 -05:00
Glauber Costa	1fbd14354f	index_entry: provide a constructor This is a preparation to have their internal fields as private. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-29 14:05:36 -05:00
Glauber Costa	13d59c9618	index_entry: do away with the disk_string<> fields Now that we are using the NSM, and not the general parser for the index, there is no reason to keep using disk_string<>s in it. Since it is staying in the way of further optimizations, let's get rid of it and use bytes directly. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-29 14:05:36 -05:00
Glauber Costa	2623362d20	continuous_data_consumer: do not pass reference to child Since the child is a base class, we don't need to pass a reference: we can just cast our 'this' pointer. By doing that, the move constructor can come back. Welcome back, move constructor. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-29 20:32:56 +03:00
Glauber Costa	babccb1112	read_indexes: convert to the NSM Reading each member individually is not as efficient. Better convert to the NSM. Before: 717101.20 +- 649.77 partitions / sec (30 runs, 1 concurrent ops) After: 838169.80 +- 575.04 partitions / sec (30 runs, 1 concurrent ops) Gains: 16.88 % Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-28 19:07:39 -05:00

48 Commits