Commit Graph

61 Commits

Author SHA1 Message Date
Vladimir Krivopalov
d57380f44c sstables: Set/reset range tombstone start from end open marker.
When we skip through a wide partition using promoted index, we may land
to a position that lies in the middle of a range tombstone so we need to
be aware of it. For this, we check if the previous promoted block has an
end open marker and either set the range tombstone start using it or
reset if missing.

Note several things about the implementation.

Firstly, we have to peek back at the previous promoted index block for the
end open marker, and so we have to always preserve one more promoted
index block when we read the next batch so that we can stil access it.

Secondly, we use the previous promoted block end position to build
position in partition for the range tombstone start.

Lastly, we don't have a notion of end open marker in older consumers
that work with SSTables of ka/la formats so we only call the
corresponding methods if the consumer supports them.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-05 09:48:17 -07:00
Vladimir Krivopalov
ac0c71bdc1 sstables: For end_open_marker, return both position in partition and deletion time.
Prior to this fix, the end_open_marker has been only accessible as a
plain deletion_time structure. Now it also contains the start position
of a promoted index block so that it can be used for setting range
tombstone open bound.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-04 18:16:21 -07:00
Vladimir Krivopalov
4d3467d793 sstables: Add getter for end_open_marker to index_reader.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
5561c713d9 sstables: Do not seek through the promoted index for static row positions.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
917528c427 sstables: Read promoted index stored in SSTables 3.x ('mc') format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
86d14f8166 sstables: Make promoted_index_block support clustering keys for both ka/la and mc formats.
This is a pre-requisite for parsing promoted index blocks written in
SSTables 'mc' format.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
f50ffa267f sstables: Support parsing index entries from SSTables 3.x format.
With this patch, index_reader is capable of reading index_entries from
both 'ka'/'la' and 'mc' formats.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Tomasz Grabiec
b17f7257a9 sstables: index_reader: Reduce size of index_entry by indirecting promoted_index
Reduces size of index_entry from 384 bytes to 64 bytes by using
indirection for the optional promoted index instead of embedding it.

Improves query time from 9ms to 4ms in a micro benchmark with a very
large index page.

Message-Id: <1531406354-10089-1-git-send-email-tgrabiec@scylladb.com>
2018-07-12 17:46:58 +03:00
Vladimir Krivopalov
b24eb5c11d sstables: Remove "lower_" from index_reader public methods.
The index_reader class public interface has been amended to only deal
with the upper bound cursor along with advancing the lower bound.
Since the class users can only explicitly operate with the lower bound
cursor (take data file position, advance to the next partition, etc), it
no longer makes sense to specify that the method operates on the lower
bound cursor in its name.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-29 11:48:33 -07:00
Vladimir Krivopalov
30109a693b sstables: Make index_reader::advance_upper_past() method private.
No changes made to the code except that it is moved around.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-29 11:47:48 -07:00
Vladimir Krivopalov
80d1d5017f sstables: Stop using index_reader::advance_upper_past() outside the class.
The only case when it needs to be called is when an index_reader is
advanced to a specific partition as part of sstable_reader
initialisation.

Instead, we're passing an optional upper_bound parameter that is used to
call advance_upper_past() internally if partition is found.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-29 11:47:20 -07:00
Vladimir Krivopalov
81fba73e9d sstables: Factor out promoted index into a separate class.
An index entry may or may not have a promoted index. All the optional
fields are better scoped under the same class to avoid lots of separate
optional fields and give better representation.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-28 12:28:59 -07:00
Vladimir Krivopalov
fc629b9ca6 sstables: Use std::optional instead of std::experimental optional in index_reader.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-27 16:47:53 -07:00
Vladimir Krivopalov
3a9cb54c76 Merge the pair of index_readers into just one tracking a range.
Historically, we had two index_readers per a sstable_mutation_reader,
one for the lower bound and one for the upper bound. Most of public
members of the index_reader class were only called on either of those.
With the changes introduced in #2981, two readers are even more tied
together as they now have a shared-per-pair list of index pages that
needs proper cleanup and was protruding woefully into the caller code.

This fix re-structures index_reader so that it now keeps track of both
lower and upper bounds. The shared_index_lists structure is encapsulated
within index_reader and becomes an internal detail rather than a
liability.

Fixes #3220.

Tests: unit (debug, release)
+
Tested using cassandra-stress commands from #3189.

perf_fast_forward results indicate there is no performance degradation
caused by thix fix.

=========================== Baseline ===================================
running: large-partition-skips
Testing scanning large partition with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1       0         0.494458   1000000    2022418   1018     126960      27       0        0        0        0        0        0        0  97.6%
1       1         1.754717    500000     284946    997     127064       6       0        0        3        3        0        0        0  99.9%
1       8         0.551664    111112     201413    997     127064       6       0        0        3        3        0        0        0  99.7%
1       16        0.383888     58824     153232   1001     127080      10       0        0        5        5        0        0        0  99.5%
1       32        0.289073     30304     104832    997     127064      28       0        0        3        3        0        0        0  99.3%
1       64        0.236963     15385      64926    997     127064     122       0        0        3        3        0        0        0  99.2%
1       256       0.172901      3892      22510    997     127064     217       0        0        3        3        0        0        0  95.5%
1       1024      0.117570       976       8301    997     127064     235       0        0        3        3        0        0        0  49.0%
1       4096      0.085811       245       2855    664      27172     375     274        0        3        3        0        0        0  21.4%
64      1         0.512781    984616    1920149   1142     127064     139       0        0        3        3        0        0        0  98.7%
64      8         0.479232    888896    1854833   1001     127080      10       0        0        5        5        0        0        0  99.6%
64      16        0.451193    800000    1773078    997     127064       6       0        0        3        3        0        0        0  99.6%
64      32        0.408684    666688    1631305    997     127064       6       0        0        3        3        0        0        0  99.5%
64      64        0.351906    500032    1420924    997     127064      14       0        0        3        3        0        0        0  99.5%
64      256       0.227008    200000     881026    997     127064     211       0        0        3        3        0        0        0  99.1%
64      1024      0.125803     58880     468032    997     127064     290       0        0        3        3        0        0        0  65.1%
64      4096      0.098155     15424     157139    703      27856     401     267        0        3        3        0        0        0  25.8%

running: large-partition-slicing
Testing slicing of large partition:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000701         1       1427      9        296       6       4        0        3        3        0        0        0  12.4%
0       32        0.000698        32      45827      9        296       6       3        0        3        3        0        0        0  13.9%
0       256       0.000808       256     316920     10        328       6       3        0        3        3        0        0        0  24.9%
0       4096      0.004368      4096     937697     25        808      14       3        0        3        3        0        0        0  45.9%
500000  1         0.001196         1        836     13        412       9       4        0        3        3        0        0        0  22.7%
500000  32        0.001200        32      26664     13        412       9       4        0        3        3        0        0        0  22.2%
500000  256       0.001503       256     170338     14        444      10       4        0        3        3        0        0        0  25.3%
500000  4096      0.004351      4096     941465     30        956      20       4        0        3        3        0        0        0  50.7%

running: large-partition-slicing-clustering-keys
Testing slicing of large partition using clustering keys:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000625         1       1601      7        176       6       0        0        3        3        0        0        0  23.2%
0       32        0.000604        32      53016      7        176       6       0        0        3        3        0        0        0  24.7%
0       256       0.000695       256     368498      8        180       6       0        0        3        3        0        0        0  36.4%
0       4096      0.004083      4096    1003106     20        692      12       1        0        3        3        0        0        0  47.0%
500000  1         0.001198         1        835     12        516       9       3        0        3        3        0        0        0  22.8%
500000  32        0.000981        32      32631     12        388       9       3        0        3        3        0        0        0  29.2%
500000  256       0.001320       256     194011     13        384      10       3        0        3        3        0        0        0  29.0%
500000  4096      0.003944      4096    1038567     25        840      17       2        0        3        3        0        0        0  52.2%

running: large-partition-slicing-single-key-reader
Testing slicing of large partition, single-partition reader:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000849         1       1178      9        488       6       0        0        3        3        0        0        0  16.5%
0       32        0.000661        32      48415      9        296       6       0        0        3        3        0        0        0  22.2%
0       256       0.000756       256     338648     10        328       6       0        0        3        3        0        0        0  33.3%
0       4096      0.004147      4096     987610     22        840      12       1        0        3        3        0        0        0  47.9%
500000  1         0.001041         1        960     13        476       9       3        0        3        3        0        0        0  25.9%
500000  32        0.001020        32      31375     13        412       9       3        0        3        3        0        0        0  29.1%
500000  256       0.001265       256     202373     14        444      10       3        0        3        3        0        0        0  32.0%
500000  4096      0.004121      4096     994014     30        988      18       3        0        3        3        0        0        0  52.7%

running: large-partition-select-few-rows
Testing selecting few rows from a large partition:
stride  rows      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1000000 1         0.000668         1       1498      9        296       6       4        0        3        3        0        0        0  19.8%
500000  2         0.000976         2       2048     13        412       9       4        0        3        3        0        0        0  29.0%
250000  4         0.001408         4       2842     18        572      12       6        0        3        3        0        0        0  28.8%
125000  8         0.002004         8       3993     29        912      19      10        0        3        3        0        0        0  34.0%
62500   16        0.002883        16       5551     50       1584      32      18        0        3        3        0        0        0  41.9%
2       500000    1.053215    500000     474737   1138     127080     120       0        0        5        5        0        0        0  99.7%

running: large-partition-forwarding
Testing forwarding with clustering restriction in a large partition:
pk-scan   time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
yes       0.002717         2        736     24       2684       8      16        0        3        3        0        0        0  19.7%
no        0.001004         2       1992     13        412       8       2        0        3        3        0        0        0  30.2%

running: small-partition-skips
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         1.466523   1000000     681885   1369     139732      33       1        0        0        0        0        0        0  99.7%
-> 1       1        12.792183    500000      39086   6235     177736    5155       0        0     5123     7663        0        0        0  96.4%
-> 1       8         3.451431    111112      32193   6235     177736    5155       0        0     5123     9673        0        0        0  84.8%
-> 1       16        2.223815     58824      26452   6234     177704    5154       0        0     5122     9965        0        0        0  75.0%
-> 1       32        1.512511     30304      20036   6233     177680    5155       1        0     5123    10090        0        0        0  61.8%
-> 1       64        1.129465     15385      13621   6227     177464    5154       0        0     5122    10159        0        0        0  49.5%
-> 1       256       0.733282      3892       5308   6211     175464    5178      24        0     5122    10220        0        0        0  33.8%
-> 1       1024      0.397302       976       2457   5946     142152    5369     217        0     5120    10235        0        0        0  32.1%
-> 1       4096      0.187746       245       1305   5499      81992    5296     142        0     5122    10240        0        0        0  46.8%
-> 64      1         2.428488    984616     405444   7332     177736    5155      25        0     5123     5208        0        0        0  79.9%
-> 64      8         2.262876    888896     392817   6235     177736    5155       0        0     5123     5654        0        0        0  78.1%
-> 64      16        2.137544    800000     374261   6234     177732    5154       0        0     5122     6110        0        0        0  77.1%
-> 64      32        1.862466    666688     357960   6235     177736    5155       0        0     5123     6844        0        0        0  73.7%
-> 64      64        1.547757    500032     323069   6234     177728    5155       0        0     5123     7651        0        0        0  68.7%
-> 64      256       0.914612    200000     218672   6233     177704    5154       0        0     5122     9202        0        0        0  55.5%
-> 64      1024      0.475472     58880     123835   6229     177492    5154       5        0     5122     9930        0        0        0  45.4%
-> 64      4096      0.271239     15424      56865   6158     169480    5257     114        0     5115    10142        0        0        0  44.1%

running: small-partition-slicing
Testing slicing small partitions:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.003209         1        312      3        260       2       7        0        1        1        0        0        0  15.5%
0       32        0.004205        32       7610     16       1428      10       0        0        5        5        0        0        0  15.7%
0       256       0.009830       256      26042     97       8572      62       0        0       31       31        0        0        0  18.7%
0       4096      0.015471      4096     264748    100       8704      64       0        0       32       32        0        0        0  48.4%
500000  1         0.003654         1        274     34        492      33       0        0       32       64        0        0        0  28.7%
500000  32        0.004287        32       7464     40       1260      36       0        0       32       64        0        0        0  26.0%
500000  256       0.009598       256      26673    100       8748      64       4        0       32       64        0        0        0  20.6%
500000  4096      0.014151      4096     289449    119       7892      85       0        0       53       64        0        0        0  54.1%

========================  With the patch ================================
running: large-partition-skips
Testing scanning large partition with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1       0         0.468887   1000000    2132711   1018     126960      29       0        0        0        0        0        0        0  98.4%
1       1         1.735113    500000     288166   1001     127080      10       0        0        5        5        0        0        0  99.9%
1       8         0.535616    111112     207447    997     127064       6       0        0        3        3        0        0        0  99.6%
1       16        0.365487     58824     160947   1001     127080      15       0        0        5        5        0        0        0  99.5%
1       32        0.272208     30304     111326    997     127064      21       0        0        3        3        0        0        0  99.3%
1       64        0.224049     15385      68668    997     127064     208       0        0        3        3        0        0        0  99.1%
1       256       0.159247      3892      24440    997     127064     250       0        0        3        3        0        0        0  94.7%
1       1024      0.102107       976       9559    997     127064     292       0        0        3        3        0        0        0  53.6%
1       4096      0.084310       245       2906    664      27172     371     273        0        3        3        0        0        0  20.2%
64      1         0.508340    984616    1936923   1142     127064     129       0        0        3        3        0        0        0  98.1%
64      8         0.470369    888896    1889786    997     127064       6       0        0        3        3        0        0        0  99.6%
64      16        0.439917    800000    1818526   1001     127080      10       0        0        5        5        0        0        0  99.6%
64      32        0.397938    666688    1675358    997     127064       6       0        0        3        3        0        0        0  99.5%
64      64        0.344144    500032    1452972    997     127064      18       0        0        3        3        0        0        0  99.4%
64      256       0.219996    200000     909107    997     127064     251       0        0        3        3        0        0        0  99.1%
64      1024      0.124294     58880     473715    997     127064     284       1        0        3        3        0        0        0  62.2%
64      4096      0.097580     15424     158065    703      27856     400     267        0        3        3        0        0        0  25.3%

running: large-partition-slicing
Testing slicing of large partition:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000733         1       1365      9        296       6       4        0        3        3        0        0        0  19.3%
0       32        0.000705        32      45417      9        296       6       3        0        3        3        0        0        0  15.3%
0       256       0.000830       256     308364     10        328       6       3        0        3        3        0        0        0  26.7%
0       4096      0.004631      4096     884529     25        808      14       3        0        3        3        0        0        0  48.1%
500000  1         0.001184         1        845     13        412       9       4        0        3        3        0        0        0  23.7%
500000  32        0.001199        32      26690     13        412       9       4        0        3        3        0        0        0  21.9%
500000  256       0.001530       256     167296     14        444      10       4        0        3        3        0        0        0  26.8%
500000  4096      0.004379      4096     935474     30        956      19       4        0        3        3        0        0        0  51.5%

running: large-partition-slicing-clustering-keys
Testing slicing of large partition using clustering keys:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000620         1       1614      7        176       6       0        0        3        3        0        0        0  27.4%
0       32        0.000625        32      51218      7        176       6       0        0        3        3        0        0        0  27.0%
0       256       0.000701       256     365148      8        180       6       0        0        3        3        0        0        0  35.2%
0       4096      0.004063      4096    1008130     20        692      12       1        0        3        3        0        0        0  47.6%
500000  1         0.001208         1        827     12        516       9       3        0        3        3        0        0        0  24.3%
500000  32        0.000973        32      32876     12        388       9       3        0        3        3        0        0        0  28.7%
500000  256       0.001315       256     194612     13        384      10       3        0        3        3        0        0        0  29.0%
500000  4096      0.003950      4096    1037068     25        840      17       2        0        3        3        0        0        0  52.7%

running: large-partition-slicing-single-key-reader
Testing slicing of large partition, single-partition reader:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000844         1       1185      9        488       6       0        0        3        3        0        0        0  16.5%
0       32        0.000656        32      48753      9        296       6       0        0        3        3        0        0        0  23.1%
0       256       0.000751       256     341011     10        328       6       0        0        3        3        0        0        0  34.0%
0       4096      0.004173      4096     981632     22        840      12       1        0        3        3        0        0        0  47.0%
500000  1         0.001036         1        966     13        476       9       3        0        3        3        0        0        0  25.4%
500000  32        0.001014        32      31573     13        412       9       3        0        3        3        0        0        0  27.4%
500000  256       0.001280       256     200044     14        444      10       3        0        3        3        0        0        0  31.8%
500000  4096      0.004081      4096    1003746     30        988      18       3        0        3        3        0        0        0  51.6%

running: large-partition-select-few-rows
Testing selecting few rows from a large partition:
stride  rows      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1000000 1         0.000668         1       1498      9        296       6       3        0        3        3        0        0        0  21.7%
500000  2         0.000958         2       2088     13        412       9       4        0        3        3        0        0        0  27.7%
250000  4         0.001495         4       2676     18        572      12       6        0        3        3        0        0        0  25.8%
125000  8         0.002069         8       3866     29        912      19      10        0        3        3        0        0        0  30.8%
62500   16        0.002856        16       5603     50       1584      32      18        0        3        3        0        0        0  41.7%
2       500000    1.063129    500000     470310   1138     127080     120       0        0        5        5        0        0        0  99.7%

running: large-partition-forwarding
Testing forwarding with clustering restriction in a large partition:
pk-scan   time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
yes       0.002567         2        779     24       2684       8      16        0        3        3        0        0        0  21.5%
no        0.001013         2       1975     13        412       8       2        0        3        3        0        0        0  28.9%

running: small-partition-skips
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         1.349959   1000000     740763   1369     139732      33       1        0        0        0        0        0        0  99.7%
-> 1       1        12.640751    500000      39555   8144     191168    7064       0        0     7032    11481        0        0        0  96.2%
-> 1       8         3.404269    111112      32639   6651     180660    5571       0        0     5539    10505        0        0        0  84.5%
-> 1       16        2.175424     58824      27040   6434     179116    5354       0        0     5322    10365        0        0        0  74.3%
-> 1       32        1.493365     30304      20292   6335     178404    5257       0        0     5225    10294        0        0        0  61.1%
-> 1       64        1.112168     15385      13833   6256     177672    5183       0        0     5151    10217        0        0        0  48.7%
-> 1       256       0.719282      3892       5411   6211     175464    5178      24        0     5122    10220        0        0        0  33.3%
-> 1       1024      0.393236       976       2482   5946     142152    5369     217        0     5120    10235        0        0        0  30.7%
-> 1       4096      0.185284       245       1322   5499      81992    5296     142        0     5122    10240        0        0        0  44.7%
-> 64      1         2.356711    984616     417792   7361     177944    5184      21        0     5152     5266        0        0        0  79.1%
-> 64      8         2.192331    888896     405457   6253     177868    5173       0        0     5141     5690        0        0        0  77.2%
-> 64      16        2.029835    800000     394121   6245     177812    5165       0        0     5133     6132        0        0        0  75.7%
-> 64      32        1.806448    666688     369060   6245     177808    5165       0        0     5133     6864        0        0        0  72.6%
-> 64      64        1.508492    500032     331478   6242     177788    5163       0        0     5131     7667        0        0        0  67.7%
-> 64      256       0.892881    200000     223994   6233     177704    5154       0        0     5122     9202        0        0        0  54.2%
-> 64      1024      0.465715     58880     126429   6229     177492    5154       0        0     5122     9930        0        0        0  44.0%
-> 64      4096      0.266582     15424      57858   6158     169480    5257     114        0     5115    10142        0        0        0  42.3%

running: small-partition-slicing
Testing slicing small partitions:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.003113         1        321      3        260       2       0        0        1        1        0        0        0  13.4%
0       32        0.004166        32       7682     16       1428      10       0        0        5        5        0        0        0  14.9%
0       256       0.009813       256      26088     97       8572      62       0        0       31       31        0        0        0  18.4%
0       4096      0.014798      4096     276794    100       8704      64       0        0       32       32        0        0        0  46.3%
500000  1         0.003700         1        270     34        492      33       0        0       32       64        0        0        0  28.4%
500000  32        0.004030        32       7940     40       1260      36       0        0       32       64        0        0        0  27.8%
500000  256       0.009514       256      26908    100       8748      64       0        0       32       64        0        0        0  20.2%
500000  4096      0.013368      4096     306413    119       7892      85       0        0       53       64        0        0        0  53.6%

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <a72818f79ca4081a606424545b0053fa581d49e7.1522173144.git.vladimir@scylladb.com>
2018-03-29 15:23:31 +03:00
Vladimir Krivopalov
c996191411 Close promoted index streams when closing index_readers.
Promoted index input streams must be explicitly closed when closing the
index_reader in order to ensure all the pending read-aheads are
completed.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-02-20 16:04:15 -08:00
Vladimir Krivopalov
71495691aa Use separate shared_index_lists per sstable_mutation_reader instead of a single one per sstable.
With the changes introduced in #2981, it is no longer safe to share
index_entries among multiple sstable_mutation_readers.
The original intent behind sharing index_entries among index_readers was
to avoid re-reading same pages twice as we have two index readers -
lower and upper bound - for every sstable_mutation_reader. In fact, the
shared entries were held at the sstable object level so index_readers
from different sstable_mutation_readers could have accessed them.

Now, with calls to index_reader::advance_to(pos)/index_reader::advance_past(pos),
index_entry can be accessed in a way that modifies its state if we need
to read more promoted index blocks. It is safe to keep sharing them
between two index_readers within the same sstable_mutation_reader as the
invariant is maintained that readers can be only moved forward.
We cannot safely assume, however, that this invariant holds for multiple
sstable_mutation_readers as it may happen that one of them has read and
thrown away some promoted index blocks that another one needs. So we
restrict sharing to per-sstable_mutation_reader level.

Fixes #3189.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <83957d007621fe4c62af49aebf1838bb2f32ee55.1518226793.git.vladimir@scylladb.com>
2018-02-10 15:08:45 +02:00
Vladimir Krivopalov
b91c3fd47e Use advance_past for single partition upper bound.
Instead of advancing to the next partition, try first find the more
precise position using promoted index blocks.
advance_past() only seeks within currently available PI blocks (or reads
the first batch, if never read before) and uses the position if found,
otherwise resorts to advance_to_next_partition()

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:45 -08:00
Vladimir Krivopalov
0a7a56edd5 Simplify continuous_data_consumer::consume_input() interface.
Remove redundant input parameter as continuous_data_consumer derivatives
would only use themselves as a context. So take it internally and make
the function regular (non-template) and having no parameters.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:26 -08:00
Vladimir Krivopalov
7e15e436de Parse promoted index entries lazily upon request rather than immediately.
Now promoted index is converted into an input_stream and skipped over
instead of being consumed immediately and stored as a single buffer.
The only part that is read right away is the deletion time as it is
likely to be there in the already read buffer and reading it should both
be cheap and prevent from reading the whole promoted index if only
deletion time mark is needed.

When accessed, promoted index is parsed in chunks, buffer by buffer, to
limit memory consumption.

Fixes #2981

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:15 -08:00
Duarte Nunes
f217dcc0ce sstables/sstables: Don't use incorrectly serialized promoted index
Promoted indexes generated before this patch by Scylla are considered
incorrect if they belong to a non-compound schema, due to #2993.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Tomasz Grabiec
2113299b61 sstables: index_reader: Reset lower bound for promoted index lookups from advance_to_next_partition()
_current_pi_idx was not reset from advance_to_next_partition(), which
is used when we skip to the next partition before fully consuming
it. As a result, if we try to skip to a clustering position which is
before the index block used by the last skip in the previous
partition, we would not skip assuming that the new position is in the
current block. This may result in more data being read from the
sstable than necessary.

Fixes #2984
Message-Id: <1510915793-20159-1-git-send-email-tgrabiec@scylladb.com>
2017-11-17 11:00:26 +00:00
Botond Dénes
046a1f9b05 sstables: Get rid of [[deprecated]] index_reader::get_index_entries()
Change test code (the only consumers) to read index by partitions.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b6111e92b5e0729bfa2e76fd848215804174067a.1507297154.git.bdenes@scylladb.com>
2017-10-08 12:18:52 +03:00
Raphael S. Carvalho
050a7019b8 sstables/index_reader: fix index reader for summary entry spanning lots of keys
quantity prevents index_reader from reading all index entries of a summary
entry that span more than min_index_interval entries. That can happen after
introduction of size-based sampling, and consequently, sstable will not be
able to return a key which logical position in summary entry is beyond
min_index_interval. It's ok to not use quantity because index_reader will
read all indexes until either next summary entry or end of file is reached.

Fixes test_sstable_conforms_to_mutation_source

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170812045821.25269-1-raphaelsc@scylladb.com>
2017-08-12 09:44:16 +03:00
Paweł Dziepak
960a140880 index_reader: advance_and_check_if_present() use index_comparator 2017-07-26 14:36:37 +01:00
Paweł Dziepak
dc7bad9a50 sstables: cache token in index entries
When a sstable reader is fast forwarded some index entries may be read
(and compared) multiple times. This patch makes sure that once a token
is computed we keep it around and reuse if the entry is accessed again.
2017-07-26 14:36:37 +01:00
Paweł Dziepak
bfb7b56c74 sstable: keep a pre-computed token in summary_entry
Each sstable index lookup involves a binary search in the summary and
each time a partition key of summary entry is compared with anything its
token needs to be calculated.
Since we keep summary in the memory all the time it is better to also
keep the tokens around.
2017-07-26 14:36:36 +01:00
Tomasz Grabiec
297b4b0cf5 sstables: index_reader: Remove redundant function 2017-05-04 14:59:08 +02:00
Tomasz Grabiec
ec45f1e51d sstables: index_reader: Fix abort in advance_and_check_if_present()
Happens when the key is missing and after all keys in the sstables.

Fixes #2345.
2017-05-04 14:59:08 +02:00
Tomasz Grabiec
c5baeed6d2 sstables: Improve logging 2017-04-27 18:43:49 +02:00
Tomasz Grabiec
b523815ac1 sstables: index_reader: Fix advance_to() to include relevant range tombstones
Fixes #2326.
2017-04-27 18:43:49 +02:00
Tomasz Grabiec
0b5ba13230 sstables: index_reader: Introduce advance_to_next_partition() 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
4b81844d2e sstables: index_reader: Introduce advance_and_check_if_present() 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
b92f095bf0 sstables: index_reader: Introduce advance_past() 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
6780756258 sstables: index_reader: Make copyable 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
7db83fa3fe sstables: index_reader: Optimize advancing to extreme positions 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
f66443c01c sstables: index_reader: Keep two last pages alive
The idea behind caching is that when we have two index readers where
one is catching up with the other, each page will be read only
once. Currently that's not always the case. There is a case when
advance_to() may need to read two pages. That's when the target
position is not found in the first page as determined by the summary
index. The second reader which catches up would have to read the first
page as well, but it would not be in cache any more. To avoid this
extra I/O let's keep a reference to the two last pages touched by the
index.
2017-04-20 10:54:38 +02:00
Tomasz Grabiec
e35fe7492c sstables: index_reader: Expose access to partition key and tombstone 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
3fbc0bed6e sstables: sstable_streamed_mutation: use index in fast_forward_to() 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
a9252dfc58 sstables: Use separate index readers for lower and upper bounds
So that lower bound can be advanced within the range.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
27d86dfe18 sstables: Enable skipping to cells at data_consume_context level 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
aad943523a sstables: index_reader: Add trace-level logging 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
1dbd2e239e sstables: index_reader: Share index lists among other index readers
Direct motivation for this is to be able to use two index readers from
a single mutation reader, one for lower bound of the range and one for
the upper bound of the range, without sacrificing optimization of
avoiding index reads when forwarding to partition ranges which are
close by. After the change, all index readers of given sstable will
share index buffers, so lower bound reader can reuse the page read by
the upper bound reader.

The reason for using two readers will be so that we are able to skip
inside the partition range, not only outside of it. This is not
possible if we use the same index reader to locate the upper bound of
the range, because we may only advance the cursor.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
e36979da47 sstables: index_reader: Use sstable's schema
Makes for a simpler interface.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
e3e2f037bb sstables: index_reader: Refactor around the concept of a cursor
Index reader already can be queried only with monotonic positions, so
the concept of a cursor is ingrained. Making it explicit will make it easier
to define behavior for forwarding withing the partition.

After the change:

 - lower_bound() is renamed to advance_to() and doesn't return
   the position, only advances the cursor

 - data file position for partition under cursor can be obtained
   at any time with data_file_position()
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
27862fa8f6 sstables: index_reader: Narrow down summary range during lookup
Positions passed to lower_bound() must be non-decreasing, so summary
indexes as well.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
02ace99798 sstables: index_reader: Change lookup to work on ring_position_view
In preparation for changing the interface to work not only with ranges.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
6c75614d19 sstables: Fix input_stream not being closed by index_reader
Fixes #2022
Message-Id: <1484912679-5729-1-git-send-email-tgrabiec@scylladb.com>
2017-01-20 11:58:33 +00:00
Asias He
e5485f3ea6 Get rid of query::partition_range
Use dht::partition_range instead
2016-12-19 08:09:25 +08:00
Gleb Natapov
ae0a2935b4 sstables: fix ad-hoc summary creation
If sstable Summary is not present Scylla does not refuses to boot but
instead creates summary information on the fly. There is a bug in this
code though. Summary files is a map between keys and offsets into Index
file, but the code creates map between keys and Data file offsets
instead. Fix it by keeping offset of an index entry in index_entry
structure and use it during Summary file creation.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20161116165421.GA22296@scylladb.com>
2016-11-17 11:05:23 +02:00
Paweł Dziepak
20bfa1fa52 sstables: drop sstable::{lower, upper}_bound()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00