Commit Graph

15976 Commits

Author SHA1 Message Date
Piotr Sarna
27bf20aa3f cql3: enable ALLOW FILTERING
Enables 'ALLOW FILTERING' queries by transfering control
to result_set_builder::filtering_visitor.
Both regular and primary key columns are allowed,
but some things are left unimplemented:
 - multi-column restrictions
 - CONTAINS queries

Fixes #2025
2018-07-05 10:50:43 +02:00
Piotr Sarna
7b018f6fd6 service: add filtering_pager
For paged results of an 'ALLOW FILTERING' query, a filtering pager
is provided. It's based on a filtering_visitor for result_builder.
2018-07-05 10:50:43 +02:00
Piotr Sarna
a08fba19e3 cql3: optimize filtering partition keys and static rows
If any restriction on partition key or static row part fails,
it will be so for every row that belongs to a partition.
Hence, full check of the rest of the rows is skipped.
2018-07-05 10:50:43 +02:00
Piotr Sarna
2a0b720102 cql3: add filtering visitor
In order to filter results of an 'ALLOW FILTERING' query,
a visitor that can take optional filter for result_builder
is provided. It defaults to nop_filter, which accepts
all rows.
2018-07-05 10:50:43 +02:00
Piotr Sarna
1cf5653f89 cql3: move result_set_builder functions to header
Moving function definitions to header is a preparation step
before turning result_set_builder into a template.
2018-07-05 10:50:43 +02:00
Piotr Sarna
4d3d32f465 cql3: amend need_filtering()
Previous implementation of need_filtering() was too eager to assume
that index query should be used, whereas sometimes a query should
just be filtered.
2018-07-05 10:50:39 +02:00
Piotr Sarna
f42eaff75e cql3: add single column primary key restrictions getters
Getters for single column partition/clustering key restrictions
are added to statement_restrictions.
2018-07-04 09:48:32 +02:00
Piotr Sarna
a99acbc376 cql3: expose single column primary key restrictions
Underlying single_column_restrictions are exposed
for single_column_primary_key_restrictions via a const method.
2018-07-04 09:48:32 +02:00
Piotr Sarna
f7a2f15935 cql3: add needs_filtering to primary key restrictions
Primary key restrictions sometimes require filtering. These functions
return true if ALLOW FILTERING needs to be enabled in order to satisfy
these restrictions.
2018-07-04 09:48:32 +02:00
Piotr Sarna
6aec9e711f cql3: add simpler single_column_restriction::is_satisfied_by
Currently restriction::is_satisfied_by() accepts only keys and rows
as arguments. In this commit, a version that only takes bytes of data
is provided.
This simpler version applies to single_column_restriction only,
because it compares raw bytes underneath anyway. For other restriction
types, simplified is_satisfied_by is not defined.
2018-07-04 09:48:32 +02:00
Alexys Jacob
8c03c1e2ce Support Gentoo Linux on node_health_check script.
Gentoo Linux was not supported by the node_health_check script
which resulted in the following error message displayed:

"This s a Non-Supported OS, Please Review the Support Matrix"

This patch adds support for Gentoo Linux while adding a TODO note
to add support for authenticated clusters which the script does
not support yet.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180703124458.3788-1-ultrabug@gentoo.org>
2018-07-03 20:18:13 +03:00
Tomasz Grabiec
2ffb621271 Merge "Fix atomic_cell_or_collection::external_memory_usage()" from Paweł
After the transition to the new in-memory representation in
aab6b0ee27 'Merge "Introduce new in-memory
representation for cells" from Paweł'
atomic_cell_or_collection::external_memory_usage() stopped accounting
for the externally stored data. Since, it wasn't covered by the unit
tests the bug remained unnotices until now.

This series fixes the memory usage calculation and adds proper unit
tests.

* https://github.com/pdziepak/scylla.git fix-external-memory-usage/v1:
  tests/mutation: properly mark atomic_cells that are collection members
  imr::utils::object: expose size overhead
  data::cell: expose size overhead of external chunks
  atomic_cell: add external chunks and overheads to
    external_memory_usage()
  tests/mutation: test external_memory_usage()
2018-07-03 14:58:10 +02:00
Botond Dénes
c236a96d7d tests/cql_query_tess: add unit test for querying empty ranges test
A bug was found recently (#3564) in the paging logic, where the code
assumed the queried ranges list is non-empty. This assumption is
incorrect as there can be valid (if rare) queries that can result in the
ranges list to be empty. Add a unit test that executes such a query with
paging enabled to detect any future bugs related to assumptions about
the ranges list being non-empty.

Refs: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f5ba308c4014c24bb392060a7e72e7521ff021fa.1530618836.git.bdenes@scylladb.com>
2018-07-03 13:43:17 +01:00
Botond Dénes
59a30f0684 query_pager: be prepared to _ranges being empty
do_fetch_page() checks in the beginning whether there is a saved query
state already, meaning this is not the first page. If there is not it
checks whether the query is for a singulular partitions or a range scan
to decide whether to enable the stateful queries or not. This check
assumed that there is at least one range in _ranges which will not hold
under some circumstances. Add a check for _ranges being empty.

Fixes: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <cbe64473f8013967a93ef7b2104c7ca0507afac9.1530610709.git.bdenes@scylladb.com>
2018-07-03 11:05:01 +01:00
Avi Kivity
eafd16266d tests: reduce multishard_mutation_test runtime in debug mode
Debug mode is so slow that generating 1000 mutations is too much for it.
High memory use can also confuse the santitizers that track each allocation.

Reduce mutation count from 1000 to 10 in debug mode.
2018-07-03 12:01:44 +03:00
Avi Kivity
a36b1f1967 Merge "more scylla_setup fixes" from Takuya
"
Added NIC / Disk existance check, --force-raid mode on
scylla_raid_setup.
"

* 'scylla_setup_fix4' of https://github.com/syuu1228/scylla:
  dist/common/scripts/scylla_raid_setup: verify specified disks are unused
  dist/common/scripts/scylla_raid_setup: add --force-raid to construct raid even only one disk is specified
  dist/common/scripts/scylla_setup: don't accept disk path if it's not block device
  dist/common/scripts/scylla_raid_setup: verify specified disk paths are block device
  dist/common/scripts/scylla_sysconfig_setup: verify NIC existance
2018-07-03 11:03:08 +03:00
Takuya ASADA
d0f39ea31d dist/common/scripts/scylla_raid_setup: verify specified disks are unused
Currently only scylla_setup interactive mode verifies selected disks are
unused, on non-interactive mode we get mdadm/mkfs.xfs program error and
python backtrace when disks are busy.

So we should verify disks are unused also on scylla_raid_setup, print
out simpler error message.
2018-07-03 14:50:34 +09:00
Takuya ASADA
3289642223 dist/common/scripts/scylla_raid_setup: add --force-raid to construct raid even only one disk is specified
User may want to start RAID volume with only one disk, add an option to
force constructing RAID even only one disk specified.
2018-07-03 14:50:34 +09:00
Takuya ASADA
e0c16c4585 dist/common/scripts/scylla_setup: don't accept disk path if it's not block device
Need to ignore input when specified path is not block device.
2018-07-03 14:50:34 +09:00
Takuya ASADA
24ca2d85c6 dist/common/scripts/scylla_raid_setup: verify specified disk paths are block device
Verify disk paths are block device, exit with error if not.
2018-07-03 14:50:34 +09:00
Takuya ASADA
99b5cf1f92 dist/common/scripts/scylla_sysconfig_setup: verify NIC existance
Verify NIC existance before writing sysconfig file to prevent causing
error while running scylla.

See #2442
2018-07-03 14:50:34 +09:00
Takuya ASADA
084c824d12 scripts: merge scylla_install_pkg to scylla-ami
scylla_install_pkg is initially written for one-liner-installer, but now
it only used for creating AMI, and it just few lines of code, so it should be
merge into scylla_install_ami script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180612150106.26573-2-syuu@scylladb.com>
2018-07-02 13:20:09 +03:00
Takuya ASADA
fafcacc31c dist/ami: drop Ubuntu AMI support
Drop Ubuntu AMI since it's not maintained for a long time, and we have
no plan to officially provide it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180612150106.26573-1-syuu@scylladb.com>
2018-07-02 13:20:08 +03:00
Avi Kivity
677991f353 Uodate scylla-ami submodule
* dist/ami/files/scylla-ami 36e8511...0fd9d23 (2):
  > scylla_install_ami: merge scylla_install_pkg
  > scylla_install_ami: drop Ubuntu AMI
2018-07-02 13:19:34 +03:00
Avi Kivity
0b148d0070 Merge "scylla_setup fixes" from Takuya
"
I found problems on previously submmited patchset 'scylla_setup fixes'
and 'more fixes for scylla_setup', so fixed them and merged into one
patchset.

Also added few more patches.
"

* 'scylla_setup_fix3' of https://github.com/syuu1228/scylla:
  dist/common/scripts/scylla_setup: allow input multiple disk paths on RAID disk prompt
  dist/common/scripts/scylla_raid_setup: skip constructing RAID0 when only one disk specified
  dist/common/scripts/scylla_raid_setup: fix module import
  dist/common/scripts/scylla_setup: check disk is used in MDRAID
  dist/common/scripts/scylla_setup: move unmasking scylla-fstrim.timer on scylla_fstrim_setup
  dist/common/scripts/scylla_setup: use print() instead of logging.error()
  dist/common/scripts/scylla_setup: implement do_verify_package() for Gentoo Linux
  dist/common/scripts/scylla_coredump_setup: run os.remove() when deleting directory is symlink
  dist/common/scripts/scylla_setup: don't include the disk on unused list when it contains partitions
  dist/common/scripts/scylla_setup: skip running rest of the check when the disk detected as used
  dist/common/scripts/scylla_setup: add a disk to selected list correctly
  dist/common/scripts/scylla_setup: fix wrong indent
  dist/common/scripts: sync instance type list for detect NIC type to latest one
  dist/common/scripts: verify systemd unit existance using 'systemctl cat'
2018-07-02 10:21:49 +03:00
Avi Kivity
a45c3aa8c7 Merge "Fix handling of stale write replies in storage_proxy" from Gleb
"
If a coordinator sends write requests with ID=X and restarts it may get a reply to
the request after it restarts and sends another request with the same ID (but to
different replicas). This condition will trigger an assert in a coordinator. Drop
the assertion in favor of a warning and initialize handler id in a way to make
this situation less likely.

Fixes: #3153
"

* 'gleb/write-handler-id' of github.com:scylladb/seastar-dev:
  storage_proxy: initialize write response id counter from wall clock value
  storage_proxy: drop virtual from signal(gms::inet_address)
  storage_proxy: do not assert on getting an unexpected write reply
2018-07-01 17:59:54 +03:00
Gleb Natapov
19e7493d5b storage_proxy: initialize write response id counter from wall clock value
Initializing write response id to the same value on each reboot may
cause stale id to be taken for active one if node restarts after
sending only a couple of write request and before receiving replies.
On next reboot it will start assigning id's from the same value and
receiving old replies will confuse it. Mitigate this by assigning
initial id to wall clock value in milliseconds. It will not solve the
problem completely, but will mitigate it.
2018-07-01 17:24:40 +03:00
Nadav Har'El
3194ce16b3 repair: fix combination of "-pr" and "-local" repair options
When nodetool repair is used with the combination of the "-pr" (primary
range) and "-local" (only repair with nodes in the same DC) options,
Scylla needs to define the "primary ranges" differently: Rather than
assign one node in the entire cluster to be the primary owner of every
token, we need one node in each data-center - so that a "-local"
repair will cover all the tokens.

Fixes #3557.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180701132445.21685-1-nyh@scylladb.com>
2018-07-01 16:39:33 +03:00
Gleb Natapov
569437aaa5 storage_proxy: drop virtual from signal(gms::inet_address)
The function is not overridden, so should not be virtual.
2018-07-01 16:35:59 +03:00
Gleb Natapov
5ee09e5f3b storage_proxy: do not assert on getting an unexpected write reply
In theory we should not get write reply from a node we did not send
write to, but in practice stale reply can be received if node reboot
between sending write and getting a reply. Do not assert, but log the
warning instead and ignore the reply.

Fixes: #3153
2018-07-01 16:35:09 +03:00
Tomasz Grabiec
b464b66e90 row_cache: Fix memtable reads concurrent with cache update missing writes
Introduced in 5b59df3761.

It is incorrect to erase entries from the memtable being moved to
cache if partition update can be preempted because a later memtable
read may create a snapshot in the memtable before memtable writes for
that partition are made visible through cache. As a result the read
may miss some of the writes which were in the memtable. The code was
checking for presence of snapshots when entering the partition, but
this condition may change if update is preempted. The fix is to not
allow erasing if update is preemptible.

This also caused SIGSEGVs because we were assuming that no such
snapshots will be created and hence were not invalidating iterators on
removal of the entries, which results in undefined behavior when such
snapshots are actually created.

Fixes SIGSEGV in dtest: limits_test.py:TestLimits.max_cells_test

Fixes #3532

Message-Id: <1530129009-13716-1-git-send-email-tgrabiec@scylladb.com>
2018-07-01 15:36:05 +03:00
Avi Kivity
f3da043230 Merge "Make in-memory partition version merging preemptable" from Tomasz
"
Partition snapshots go away when the last read using the snapshot is done.
Currently we will synchronously attempt to merge partition versions on this event.
If partitions are large, that may stall the reactor for a significant amount of time,
depending on the size of newer versions. Cache update on memtable flush can
create especially large versions.

The solution implemented in this series is to allow merging to be preemptable,
and continue in the background. Background merging is done by the mutation_cleaner
associated with the container (memtable, cache). There is a single merging process
per mutation_cleaner. The merging worker runs in a separate scheduling group,
introduced here, called "mem_compaction".

When the last user of a snapshot goes away the snapshot is slided to the
oldest unreferenced version first so that the version is no longer reachable
from partition_entry::read(). The cleaner will then keep merging preceding
(newer) versions into it, until it merges a version which is referenced. The
merging is preemtable. If the initial merging is preempted, the snapshot is
enqueued into the cleaner, the worker woken up, and merging will continue
asynchronously.

When memtable is merged with cache, its cleaner is merged with cache cleaner,
so any outstanding background merges will be continued by the cache cleaner
without disruption.

This reduces scheduling latency spikes in tests/perf_row_cache_update
for the case of large partition with many rows. For -c1 -m1G I saw
them dropping from >23ms to 1-2ms. System-level benchmark using scylla-bench
shows a similar improvement.
"

* tag 'tgrabiec/merge-snapshots-gradually-v4' of github.com:tgrabiec/scylla:
  tests: perf_row_cache_update: Test with an active reader surviving memtable flush
  memtable, cache: Run mutation_cleaner worker in its own scheduling group
  mutation_cleaner: Make merge() redirect old instance to the new one
  mvcc: Use RAII to ensure that partition versions are merged
  mvcc: Merge partition version versions gradually in the background
  mutation_partition: Make merging preemtable
  tests: mvcc: Use the standard maybe_merge_versions() to merge snapshots
2018-07-01 15:32:51 +03:00
Botond Dénes
5fd9c3b9d4 tests/mutation_reader_test: require min shard-count for multishard tests
Tests testing different aspects of `foreign_reader` and
`multishard_combining_reader` are designed to run with a certain minimum
shard count. Running them with any shard count below this minimum makes
them useless at best but can even fail them.
Refuse to run these tests when the shard count is below the required
minimum to avoid an accidental and unnecessary investigation into a
false-positive test failure.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d24159415b6a9d74eafb8355b6e3fba98c1ff7ff.1530274392.git.bdenes@scylladb.com>
2018-07-01 12:44:41 +03:00
Avi Kivity
f73340e6f8 Merge "Index reader and associated types clean-up." from Vladimir
"
This patchset paves way to support for reading SSTables 3.x index files.
It aims at streamlining and tidying up the existing index_reader and
helpers and brings no functional or high-level changes.

In v3:
  - do not capture 'found' and just return 'true' in the continuation
    inside advance_and_check_if_present()
  - split code that makes the use of advance_upper_past() internal-only
    into two commits for better readability

GitHub URL: https://github.com/argenet/scylla/tree/projects/sstables-30/index_reader_cleanup/v3

Tests: unit {release}

Performance tests (perf_fast_forward) did not reveal any noticeable
changes. The complete output is below.

========================================
Original code (before the patchset)
========================================
running: large-partition-skips
Testing scanning large partition with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1       0         0.336514   1000000    2971642   1000     126956      35       0        0        0        0        0        0        0  99.5%
1       1         1.411239    500000     354299    993     127056       2       0        0        1        1        0        0        0  99.9%
1       8         0.464468    111112     239224    993     127056       2       0        0        1        1        0        0        0  99.8%
1       16        0.330490     58824     177990    993     127056      12       0        0        1        1        0        0        0  99.7%
1       32        0.257010     30304     117910    993     127056      15       0        0        1        1        0        0        0  99.7%
1       64        0.213650     15385      72010    997     127072     268       0        0        3        3        0        0        0  99.5%
1       256       0.159498      3892      24402    993     127056     245       0        0        1        1        0        0        0  95.5%
1       1024      0.088678       976      11006    993     127056     347       0        0        1        1        0        0        0  63.4%
1       4096      0.082627       245       2965    649      22452     389     252        0        1        1        0        0        0  20.0%
64      1         0.411080    984616    2395191   1059     127056      57       1        0        1        1        0        0        0  99.1%
64      8         0.390130    888896    2278461    993     127056       2       0        0        1        1        0        0        0  99.8%
64      16        0.369033    800000    2167828    993     127056       3       0        0        1        1        0        0        0  99.8%
64      32        0.338126    666688    1971714    993     127056      10       0        0        1        1        0        0        0  99.7%
64      64        0.297335    500032    1681711    997     127072      18       0        0        3        3        0        0        0  99.7%
64      256       0.199420    200000    1002910    993     127056     211       0        0        1        1        0        0        0  99.5%
64      1024      0.113953     58880     516704    993     127056     284       0        0        1        1        0        0        0  64.1%
64      4096      0.094596     15424     163051    687      23684     415     248        0        1        1        0        0        0  23.7%

running: large-partition-slicing
Testing slicing of large partition:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000586         1       1706      3        164       2       1        0        1        1        0        0        0   9.0%
0       32        0.000587        32      54539      3        164       2       1        0        1        1        0        0        0   9.9%
0       256       0.000688       256     372343      4        196       2       1        0        1        1        0        0        0  20.7%
0       4096      0.004320      4096     948185     19        676      10       1        0        1        1        0        0        0  36.7%
500000  1         0.000882         1       1134      5        228       3       2        0        1        1        0        0        0  14.3%
500000  32        0.000881        32      36321      5        228       3       2        0        1        1        0        0        0  14.3%
500000  256       0.000961       256     266386      6        260       3       2        0        1        1        0        0        0  21.9%
500000  4096      0.003127      4096    1309805     21        740      14       2        0        1        1        0        0        0  54.0%

running: large-partition-slicing-clustering-keys
Testing slicing of large partition using clustering keys:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000639         1       1564      3        164       2       0        0        1        1        0        0        0  13.9%
0       32        0.000626        32      51154      3        164       2       0        0        1        1        0        0        0  15.3%
0       256       0.000716       256     357560      4        168       2       0        0        1        1        0        0        0  23.1%
0       4096      0.003681      4096    1112743     16        680       8       1        0        1        1        0        0        0  38.5%
500000  1         0.000966         1       1035      4        424       3       2        0        1        1        0        0        0  12.4%
500000  32        0.000911        32      35121      5        296       3       1        0        1        1        0        0        0  13.1%
500000  256       0.000978       256     261645      5        296       3       1        0        1        1        0        0        0  19.1%
500000  4096      0.003155      4096    1298139     11        744       6       1        0        1        1        0        0        0  44.5%

running: large-partition-slicing-single-key-reader
Testing slicing of large partition, single-partition reader:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000756         1       1323      4        484       2       0        0        1        1        0        0        0  11.3%
0       32        0.000625        32      51174      3        164       2       0        0        1        1        0        0        0  15.5%
0       256       0.000705       256     363337      4        196       2       0        0        1        1        0        0        0  24.3%
0       4096      0.003603      4096    1136829     16        900       8       1        0        1        1        0        0        0  44.4%
500000  1         0.000880         1       1136      5        228       3       3        0        1        1        0        0        0  12.6%
500000  32        0.000882        32      36268      5        228       3       1        0        1        1        0        0        0  14.0%
500000  256       0.000965       256     265178      6        260       3       1        0        1        1        0        0        0  20.8%
500000  4096      0.003098      4096    1322024     21        740      14       2        0        1        1        0        0        0  54.6%

running: large-partition-select-few-rows
Testing selecting few rows from a large partition:
stride  rows      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1000000 1         0.000631         1       1585      3        164       2       2        0        1        1        0        0        0  15.2%
500000  2         0.000873         2       2291      5        228       3       2        0        1        1        0        0        0  13.2%
250000  4         0.001404         4       2850      9        356       5       4        0        1        1        0        0        0  11.9%
125000  8         0.002878         8       2779     21        740      13       8        0        1        1        0        0        0  15.5%
62500   16        0.005184        16       3087     41       1380      25      16        0        1        1        0        0        0  19.3%
2       500000    0.948899    500000     526926   1040     127056      39       0        0        1        1        0        0        0  99.9%

running: large-partition-forwarding
Testing forwarding with clustering restriction in a large partition:
pk-scan   time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
yes       0.001813         2       1103     11       1380       3       8        0        1        1        0        0        0  18.5%
no        0.000922         2       2170      5        228       3       1        0        1        1        0        0        0  14.1%

running: small-partition-skips
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         1.023396   1000000     977139   1104     139668      12       0        0        2        2        0        0        0  99.7%
-> 1       1         2.176794    500000     229696   6200     177660    5109       0        0     5108     7679        0        0        0  69.9%
-> 1       8         1.130179    111112      98314   6200     177660    5109       0        0     5108     9647        0        0        0  41.5%
-> 1       16        0.972022     58824      60517   6200     177660    5109       0        0     5108     9913        0        0        0  32.0%
-> 1       32        0.880783     30304      34406   6201     177664    5110       0        0     5108    10057        0        0        0  25.2%
-> 1       64        0.829019     15385      18558   6199     177656    5108       0        0     5107    10135        0        0        0  20.4%
-> 1       256       2.248487      3892       1731   5028     168948    3937       0        0     3936     7801        0        0        0   4.6%
-> 1       1024      0.342806       976       2847   2076     146948     985     105        0      984     1955        0        0        0   9.3%
-> 1       4096      0.088605       245       2765    739      18152     492     246        0      247      490        0        0        0  11.1%
-> 64      1         1.796715    984616     548009   6274     177660    5120       0        0     5108     5187        0        0        0  63.1%
-> 64      8         1.688994    888896     526287   6200     177660    5109       0        0     5108     5674        0        0        0  61.2%
-> 64      16        1.593196    800000     502135   6200     177660    5109       0        0     5108     6143        0        0        0  58.7%
-> 64      32        1.438651    666688     463412   6200     177660    5109       0        0     5108     6807        0        0        0  56.5%
-> 64      64        1.290205    500032     387560   6200     177660    5109       0        0     5108     7660        0        0        0  49.2%
-> 64      256       2.136466    200000      93613   5252     170616    4161       0        0     4160     6267        0        0        0  13.8%
-> 64      1024      0.388871     58880     151413   2317     148784    1226     107        0     1225     1844        0        0        0  23.4%
-> 64      4096      0.107253     15424     143809    807      19100     562     244        0      321      482        0        0        0  24.2%

running: small-partition-slicing
Testing slicing small partitions:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.002773         1        361      3         68       2       0        0        1        1        0        0        0  10.5%
0       32        0.002905        32      11015      3         68       2       0        0        1        1        0        0        0  11.6%
0       256       0.003170       256      80764      4        104       2       0        0        1        1        0        0        0  17.8%
0       4096      0.008125      4096     504095     20        616      11       1        0        1        1        0        0        0  54.1%
500000  1         0.002914         1        343      3         72       2       0        0        1        2        0        0        0  10.7%
500000  32        0.002967        32      10786      3         72       2       0        0        1        2        0        0        0  12.6%
500000  256       0.003338       256      76685      5        112       3       0        0        2        2        0        0        0  17.4%
500000  4096      0.008495      4096     482141     21        624      12       1        0        2        2        0        0        0  52.3%

========================================
With the patchset
========================================

running: large-partition-skips
Testing scanning large partition with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1       0         0.340110   1000000    2940229   1000     126956      42       0        0        0        0        0        0        0  97.5%
1       1         1.401352    500000     356798    993     127056       2       0        0        1        1        0        0        0  99.9%
1       8         0.463124    111112     239918    993     127056       2       0        0        1        1        0        0        0  99.8%
1       16        0.330050     58824     178228    993     127056      11       0        0        1        1        0        0        0  99.7%
1       32        0.255981     30304     118384    993     127056       8       0        0        1        1        0        0        0  99.7%
1       64        0.215160     15385      71505    997     127072     263       0        0        3        3        0        0        0  99.4%
1       256       0.159702      3892      24370    993     127056     239       0        0        1        1        0        0        0  95.6%
1       1024      0.094403       976      10339    993     127056     298       0        0        1        1        0        0        0  58.9%
1       4096      0.082501       245       2970    649      22452     391     252        0        1        1        0        0        0  20.1%
64      1         0.415227    984616    2371272   1059     127056      52       1        0        1        1        0        0        0  99.3%
64      8         0.391556    888896    2270166    993     127056       2       0        0        1        1        0        0        0  99.8%
64      16        0.372075    800000    2150102    993     127056       4       0        0        1        1        0        0        0  99.7%
64      32        0.337454    666688    1975641    993     127056      15       0        0        1        1        0        0        0  99.7%
64      64        0.296345    500032    1687333    997     127072      21       0        0        3        3        0        0        0  99.7%
64      256       0.199221    200000    1003911    993     127056     204       0        0        1        1        0        0        0  99.4%
64      1024      0.118224     58880     498037    993     127056     275       0        0        1        1        0        0        0  61.8%
64      4096      0.095098     15424     162191    687      23684     417     248        0        1        1        0        0        0  23.7%

running: large-partition-slicing
Testing slicing of large partition:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000585         1       1709      3        164       2       1        0        1        1        0        0        0  10.7%
0       32        0.000589        32      54353      3        164       2       1        0        1        1        0        0        0  10.0%
0       256       0.000688       256     372293      4        196       2       1        0        1        1        0        0        0  20.7%
0       4096      0.004336      4096     944562     19        676      10       1        0        1        1        0        0        0  36.9%
500000  1         0.000877         1       1140      5        228       3       2        0        1        1        0        0        0  13.6%
500000  32        0.000883        32      36222      5        228       3       2        0        1        1        0        0        0  14.4%
500000  256       0.000963       256     265804      6        260       3       2        0        1        1        0        0        0  22.0%
500000  4096      0.003008      4096    1361779     21        740      17       2        0        1        1        0        0        0  56.7%

running: large-partition-slicing-clustering-keys
Testing slicing of large partition using clustering keys:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000623         1       1604      3        164       2       0        0        1        1        0        0        0  13.9%
0       32        0.000624        32      51261      3        164       2       0        0        1        1        0        0        0  14.7%
0       256       0.000714       256     358484      4        168       2       0        0        1        1        0        0        0  22.6%
0       4096      0.003687      4096    1110990     16        680       8       1        0        1        1        0        0        0  38.6%
500000  1         0.000973         1       1028      4        424       3       2        0        1        1        0        0        0  12.1%
500000  32        0.000914        32      35022      5        296       3       1        0        1        1        0        0        0  12.8%
500000  256       0.000986       256     259646      5        296       3       1        0        1        1        0        0        0  19.7%
500000  4096      0.003155      4096    1298122     11        744       6       1        0        1        1        0        0        0  44.5%

running: large-partition-slicing-single-key-reader
Testing slicing of large partition, single-partition reader:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000766         1       1305      4        484       2       0        0        1        1        0        0        0  12.2%
0       32        0.000626        32      51111      3        164       2       0        0        1        1        0        0        0  15.2%
0       256       0.000710       256     360563      4        196       2       0        0        1        1        0        0        0  25.2%
0       4096      0.003963      4096    1033440     16        900       8       1        0        1        1        0        0        0  40.2%
500000  1         0.000877         1       1141      5        228       3       1        0        1        1        0        0        0  12.7%
500000  32        0.000882        32      36272      5        228       3       1        0        1        1        0        0        0  14.2%
500000  256       0.000959       256     266937      6        260       3       1        0        1        1        0        0        0  21.1%
500000  4096      0.003103      4096    1319992     21        740      14       2        0        1        1        0        0        0  53.9%

running: large-partition-select-few-rows
Testing selecting few rows from a large partition:
stride  rows      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1000000 1         0.000631         1       1586      3        164       2       2        0        1        1        0        0        0  13.8%
500000  2         0.000872         2       2295      5        228       3       2        0        1        1        0        0        0  13.4%
250000  4         0.001483         4       2698      9        356       5       4        0        1        1        0        0        0  11.2%
125000  8         0.002894         8       2764     21        740      13       8        0        1        1        0        0        0  15.6%
62500   16        0.005182        16       3087     41       1380      25      16        0        1        1        0        0        0  19.5%
2       500000    0.942943    500000     530255   1040     127056      38       0        0        1        1        0        0        0  99.9%

running: large-partition-forwarding
Testing forwarding with clustering restriction in a large partition:
pk-scan   time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
yes       0.001807         2       1107     11       1380       3       8        0        1        1        0        0        0  18.9%
no        0.000924         2       2165      5        228       3       1        0        1        1        0        0        0  14.1%

running: small-partition-skips
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         1.009953   1000000     990145   1104     139668      11       0        0        2        2        0        0        0  99.7%
-> 1       1         2.213846    500000     225851   6200     177660    5109       0        0     5108     7679        0        0        0  70.3%
-> 1       8         1.150029    111112      96617   6200     177660    5109       0        0     5108     9647        0        0        0  42.3%
-> 1       16        0.989438     58824      59452   6200     177660    5109       0        0     5108     9913        0        0        0  33.2%
-> 1       32        0.891590     30304      33989   6201     177664    5110       0        0     5108    10057        0        0        0  26.4%
-> 1       64        0.840952     15385      18295   6199     177656    5108       0        0     5107    10135        0        0        0  21.6%
-> 1       256       2.247875      3892       1731   5028     168948    3937       0        0     3936     7801        0        0        0   5.0%
-> 1       1024      0.345917       976       2821   2076     146948     985     105        0      984     1955        0        0        0  10.0%
-> 1       4096      0.088806       245       2759    739      18152     492     246        0      247      490        0        0        0  11.6%
-> 64      1         1.821995    984616     540406   6274     177660    5119       0        0     5108     5187        0        0        0  63.9%
-> 64      8         1.715052    888896     518291   6200     177660    5109       0        0     5108     5674        0        0        0  61.9%
-> 64      16        1.620385    800000     493710   6200     177660    5109       0        0     5108     6143        0        0        0  59.4%
-> 64      32        1.464497    666688     455233   6200     177660    5109       0        0     5108     6807        0        0        0  56.9%
-> 64      64        1.311386    500032     381300   6200     177660    5109       0        0     5108     7660        0        0        0  50.0%
-> 64      256       2.153954    200000      92853   5252     170616    4161       0        0     4160     6267        0        0        0  14.3%
-> 64      1024      0.350275     58880     168097   2317     148784    1226     107        0     1225     1844        0        0        0  27.5%
-> 64      4096      0.107498     15424     143482    807      19100     562     244        0      321      482        0        0        0  24.5%

running: small-partition-slicing
Testing slicing small partitions:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.002872         1        348      3         68       2       0        0        1        1        0        0        0  10.2%
0       32        0.002833        32      11297      3         68       2       0        0        1        1        0        0        0  12.1%
0       256       0.003145       256      81404      4        104       2       0        0        1        1        0        0        0  17.9%
0       4096      0.008110      4096     505079     20        616      12       1        0        1        1        0        0        0  54.4%
500000  1         0.002934         1        341      3         72       2       1        0        1        2        0        0        0  10.6%
500000  32        0.002871        32      11145      3         72       2       0        0        1        2        0        0        0  12.0%
500000  256       0.003216       256      79598      5        112       3       0        0        2        2        0        0        0  18.3%
500000  4096      0.008557      4096     478692     21        624      12       1        0        2        2        0        0        0  51.9%
"

* 'projects/sstables-30/index_reader_cleanup/v3' of https://github.com/argenet/scylla:
  sstables: Remove "lower_" from index_reader public methods.
  sstables: Make index_reader::advance_upper_past() method private.
  sstables: Stop using index_reader::advance_upper_past() outside the class.
  sstables: Move promoted_index_block from types.hh to index_entry.hh.
  sstables: Factor out promoted index into a separate class.
  sstables: Use std::optional instead of std::experimental optional in index_reader.
2018-07-01 12:30:29 +03:00
Botond Dénes
da53ea7a13 tests.py: add --jobs command line parameter
Allowing for setting the number of jobs to use for running the tests.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d58d6393c6271bffc37ab3b5edc37b00ef485d9c.1529433590.git.bdenes@scylladb.com>
2018-07-01 12:26:41 +03:00
Vladimir Krivopalov
b24eb5c11d sstables: Remove "lower_" from index_reader public methods.
The index_reader class public interface has been amended to only deal
with the upper bound cursor along with advancing the lower bound.
Since the class users can only explicitly operate with the lower bound
cursor (take data file position, advance to the next partition, etc), it
no longer makes sense to specify that the method operates on the lower
bound cursor in its name.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-29 11:48:33 -07:00
Vladimir Krivopalov
30109a693b sstables: Make index_reader::advance_upper_past() method private.
No changes made to the code except that it is moved around.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-29 11:47:48 -07:00
Vladimir Krivopalov
80d1d5017f sstables: Stop using index_reader::advance_upper_past() outside the class.
The only case when it needs to be called is when an index_reader is
advanced to a specific partition as part of sstable_reader
initialisation.

Instead, we're passing an optional upper_bound parameter that is used to
call advance_upper_past() internally if partition is found.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-29 11:47:20 -07:00
Duarte Nunes
0db5419ec5 Merge 'Avoid copies when unfreezing frozen_mutation' from Paweł
"
When frozen mutation gets deserialised current implementation copies
its value 3 times: from IDL buffer to bytes object, from bytes object to
atomic_cell and then atomic_cell is copied again. Moreover, the value
gets linearised which may cause a large allocation.

All of that is very wasteful. This patch devirtualises and reworks IDL
reading code so that when used with partition_builder the cell value is
copied only once and without linearisation: from the IDL buffer to the
final atomic_cell.

perf_simple_query -c4, medians of 30 results:
        ./perf_before  ./perf_after  diff
 read       310576.54     316273.90  1.8%
 write      359913.15     375579.44  4.4%

microbenchmark, perf_idl:

BEFORE
test                                      iterations      median         mad         min         max
frozen_mutation.freeze_one_small_row         2142435   462.431ns     0.125ns   462.306ns   467.659ns
frozen_mutation.unfreeze_one_small_row       1640949   601.422ns     0.082ns   601.340ns   605.279ns
frozen_mutation.apply_one_small_row          1538969   645.993ns     0.405ns   645.588ns   656.510ns

AFTER
test                                      iterations      median         mad         min         max
frozen_mutation.freeze_one_small_row         2139548   455.525ns     0.631ns   454.894ns   456.707ns
frozen_mutation.unfreeze_one_small_row       1760139   566.157ns     0.003ns   566.153ns   584.339ns
frozen_mutation.apply_one_small_row          1582050   610.951ns     0.060ns   610.891ns   613.044ns

Tests: unit(release)
"

* tag 'avoid-copy-unfreeze/v2' of https://github.com/pdziepak/scylla:
  mutation_partition_view: use column_mapping_entry::is_atomic()
  schema: column_mapping_entry: cache abstract_type::is_atomic()
  schema: column_mapping_entry: reduce logic duplication
  mutation_partition_view: do not linearise or copy cell value
  atomic_cell: allow passing value via ser::buffer_view
  mutation_partition_view: pass cell by value to visitor
  mutation_partition_view: devirtualise accept()
  storage_proxy: use mutation_partition_view::{first, last}_row_key()
  mutation_partition_view: add last_row_key() and first_row_key() getters
2018-06-28 22:55:20 +01:00
Paweł Dziepak
c45e291084 mutation_partition_view: use column_mapping_entry::is_atomic() 2018-06-28 22:16:42 +01:00
Paweł Dziepak
6c54a97320 schema: column_mapping_entry: cache abstract_type::is_atomic()
IDL deserialisation code calls is_atomic() for each cell. An additional
indirection and a virtual call can be avoided by caching that value in
column_mapping_entry. There is already very similar optimisation done
for column_definitions.
2018-06-28 22:16:42 +01:00
Paweł Dziepak
2bfdc2d781 schema: column_mapping_entry: reduce logic duplication
User-defined constructors often make it more likely that a careless
developer will forget to update one of them when adding a new member to
a structure. The risk of that happening can be reduced by reducing code
duplication with delegating constructors.
2018-06-28 22:16:42 +01:00
Paweł Dziepak
199f9196e9 mutation_partition_view: do not linearise or copy cell value 2018-06-28 22:11:19 +01:00
Paweł Dziepak
92700c6758 atomic_cell: allow passing value via ser::buffer_view 2018-06-28 22:11:19 +01:00
Paweł Dziepak
bf330a99f0 mutation_partition_view: pass cell by value to visitor
mutation_partition_view needs to create an atomic_cell from
IDL-serialised data. Then that cell is passed to the visitor. However,
because generic mutation_partition_visitor interface was used, the cell
was passed by constant reference which forced the visitor to needlessly
copy it.

This patch takes advantage of the fact that mutation_partition_view is
devirtualised now and adjust the interfaces of its visitors so that the
cell can be passed without copying.
2018-06-28 22:11:19 +01:00
Paweł Dziepak
569176aad1 mutation_partition_view: devirtualise accept()
There are only two types of visitors used and only one of them appears
in the hot path. They can be devirtualised without too much effort,
which also enables future custom interface specialisations specific to
mutation_partition_views and its users, not necessairly in the scope of
more general mutation_partition_visitor.
2018-06-28 22:11:19 +01:00
Paweł Dziepak
6bd71015e7 storage_proxy: use mutation_partition_view::{first, last}_row_key() 2018-06-28 22:11:19 +01:00
Paweł Dziepak
2259eee97c mutation_partition_view: add last_row_key() and first_row_key() getters
Some users (e.g. reconciliation code) need only to know the clustering
key of the first or the last row in the partition. This was done with a
full visitor visiting every single cell of the partition, which is very
wasteful. This patch adds direct getters for the needed information.
2018-06-28 22:11:19 +01:00
Vladimir Krivopalov
a497edcbda sstables: Move promoted_index_block from types.hh to index_entry.hh.
It is only being used by index_reader internally and never exposed so
should not be listed in commonly used types.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-28 12:28:59 -07:00
Vladimir Krivopalov
81fba73e9d sstables: Factor out promoted index into a separate class.
An index entry may or may not have a promoted index. All the optional
fields are better scoped under the same class to avoid lots of separate
optional fields and give better representation.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-28 12:28:59 -07:00