Enables 'ALLOW FILTERING' queries by transfering control
to result_set_builder::filtering_visitor.
Both regular and primary key columns are allowed,
but some things are left unimplemented:
- multi-column restrictions
- CONTAINS queries
Fixes#2025
If any restriction on partition key or static row part fails,
it will be so for every row that belongs to a partition.
Hence, full check of the rest of the rows is skipped.
In order to filter results of an 'ALLOW FILTERING' query,
a visitor that can take optional filter for result_builder
is provided. It defaults to nop_filter, which accepts
all rows.
Previous implementation of need_filtering() was too eager to assume
that index query should be used, whereas sometimes a query should
just be filtered.
Primary key restrictions sometimes require filtering. These functions
return true if ALLOW FILTERING needs to be enabled in order to satisfy
these restrictions.
Currently restriction::is_satisfied_by() accepts only keys and rows
as arguments. In this commit, a version that only takes bytes of data
is provided.
This simpler version applies to single_column_restriction only,
because it compares raw bytes underneath anyway. For other restriction
types, simplified is_satisfied_by is not defined.
Gentoo Linux was not supported by the node_health_check script
which resulted in the following error message displayed:
"This s a Non-Supported OS, Please Review the Support Matrix"
This patch adds support for Gentoo Linux while adding a TODO note
to add support for authenticated clusters which the script does
not support yet.
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180703124458.3788-1-ultrabug@gentoo.org>
After the transition to the new in-memory representation in
aab6b0ee27 'Merge "Introduce new in-memory
representation for cells" from Paweł'
atomic_cell_or_collection::external_memory_usage() stopped accounting
for the externally stored data. Since, it wasn't covered by the unit
tests the bug remained unnotices until now.
This series fixes the memory usage calculation and adds proper unit
tests.
* https://github.com/pdziepak/scylla.git fix-external-memory-usage/v1:
tests/mutation: properly mark atomic_cells that are collection members
imr::utils::object: expose size overhead
data::cell: expose size overhead of external chunks
atomic_cell: add external chunks and overheads to
external_memory_usage()
tests/mutation: test external_memory_usage()
A bug was found recently (#3564) in the paging logic, where the code
assumed the queried ranges list is non-empty. This assumption is
incorrect as there can be valid (if rare) queries that can result in the
ranges list to be empty. Add a unit test that executes such a query with
paging enabled to detect any future bugs related to assumptions about
the ranges list being non-empty.
Refs: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f5ba308c4014c24bb392060a7e72e7521ff021fa.1530618836.git.bdenes@scylladb.com>
do_fetch_page() checks in the beginning whether there is a saved query
state already, meaning this is not the first page. If there is not it
checks whether the query is for a singulular partitions or a range scan
to decide whether to enable the stateful queries or not. This check
assumed that there is at least one range in _ranges which will not hold
under some circumstances. Add a check for _ranges being empty.
Fixes: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <cbe64473f8013967a93ef7b2104c7ca0507afac9.1530610709.git.bdenes@scylladb.com>
Debug mode is so slow that generating 1000 mutations is too much for it.
High memory use can also confuse the santitizers that track each allocation.
Reduce mutation count from 1000 to 10 in debug mode.
"
Added NIC / Disk existance check, --force-raid mode on
scylla_raid_setup.
"
* 'scylla_setup_fix4' of https://github.com/syuu1228/scylla:
dist/common/scripts/scylla_raid_setup: verify specified disks are unused
dist/common/scripts/scylla_raid_setup: add --force-raid to construct raid even only one disk is specified
dist/common/scripts/scylla_setup: don't accept disk path if it's not block device
dist/common/scripts/scylla_raid_setup: verify specified disk paths are block device
dist/common/scripts/scylla_sysconfig_setup: verify NIC existance
Currently only scylla_setup interactive mode verifies selected disks are
unused, on non-interactive mode we get mdadm/mkfs.xfs program error and
python backtrace when disks are busy.
So we should verify disks are unused also on scylla_raid_setup, print
out simpler error message.
scylla_install_pkg is initially written for one-liner-installer, but now
it only used for creating AMI, and it just few lines of code, so it should be
merge into scylla_install_ami script.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180612150106.26573-2-syuu@scylladb.com>
"
I found problems on previously submmited patchset 'scylla_setup fixes'
and 'more fixes for scylla_setup', so fixed them and merged into one
patchset.
Also added few more patches.
"
* 'scylla_setup_fix3' of https://github.com/syuu1228/scylla:
dist/common/scripts/scylla_setup: allow input multiple disk paths on RAID disk prompt
dist/common/scripts/scylla_raid_setup: skip constructing RAID0 when only one disk specified
dist/common/scripts/scylla_raid_setup: fix module import
dist/common/scripts/scylla_setup: check disk is used in MDRAID
dist/common/scripts/scylla_setup: move unmasking scylla-fstrim.timer on scylla_fstrim_setup
dist/common/scripts/scylla_setup: use print() instead of logging.error()
dist/common/scripts/scylla_setup: implement do_verify_package() for Gentoo Linux
dist/common/scripts/scylla_coredump_setup: run os.remove() when deleting directory is symlink
dist/common/scripts/scylla_setup: don't include the disk on unused list when it contains partitions
dist/common/scripts/scylla_setup: skip running rest of the check when the disk detected as used
dist/common/scripts/scylla_setup: add a disk to selected list correctly
dist/common/scripts/scylla_setup: fix wrong indent
dist/common/scripts: sync instance type list for detect NIC type to latest one
dist/common/scripts: verify systemd unit existance using 'systemctl cat'
"
If a coordinator sends write requests with ID=X and restarts it may get a reply to
the request after it restarts and sends another request with the same ID (but to
different replicas). This condition will trigger an assert in a coordinator. Drop
the assertion in favor of a warning and initialize handler id in a way to make
this situation less likely.
Fixes: #3153
"
* 'gleb/write-handler-id' of github.com:scylladb/seastar-dev:
storage_proxy: initialize write response id counter from wall clock value
storage_proxy: drop virtual from signal(gms::inet_address)
storage_proxy: do not assert on getting an unexpected write reply
Initializing write response id to the same value on each reboot may
cause stale id to be taken for active one if node restarts after
sending only a couple of write request and before receiving replies.
On next reboot it will start assigning id's from the same value and
receiving old replies will confuse it. Mitigate this by assigning
initial id to wall clock value in milliseconds. It will not solve the
problem completely, but will mitigate it.
When nodetool repair is used with the combination of the "-pr" (primary
range) and "-local" (only repair with nodes in the same DC) options,
Scylla needs to define the "primary ranges" differently: Rather than
assign one node in the entire cluster to be the primary owner of every
token, we need one node in each data-center - so that a "-local"
repair will cover all the tokens.
Fixes#3557.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180701132445.21685-1-nyh@scylladb.com>
In theory we should not get write reply from a node we did not send
write to, but in practice stale reply can be received if node reboot
between sending write and getting a reply. Do not assert, but log the
warning instead and ignore the reply.
Fixes: #3153
Introduced in 5b59df3761.
It is incorrect to erase entries from the memtable being moved to
cache if partition update can be preempted because a later memtable
read may create a snapshot in the memtable before memtable writes for
that partition are made visible through cache. As a result the read
may miss some of the writes which were in the memtable. The code was
checking for presence of snapshots when entering the partition, but
this condition may change if update is preempted. The fix is to not
allow erasing if update is preemptible.
This also caused SIGSEGVs because we were assuming that no such
snapshots will be created and hence were not invalidating iterators on
removal of the entries, which results in undefined behavior when such
snapshots are actually created.
Fixes SIGSEGV in dtest: limits_test.py:TestLimits.max_cells_test
Fixes#3532
Message-Id: <1530129009-13716-1-git-send-email-tgrabiec@scylladb.com>
"
Partition snapshots go away when the last read using the snapshot is done.
Currently we will synchronously attempt to merge partition versions on this event.
If partitions are large, that may stall the reactor for a significant amount of time,
depending on the size of newer versions. Cache update on memtable flush can
create especially large versions.
The solution implemented in this series is to allow merging to be preemptable,
and continue in the background. Background merging is done by the mutation_cleaner
associated with the container (memtable, cache). There is a single merging process
per mutation_cleaner. The merging worker runs in a separate scheduling group,
introduced here, called "mem_compaction".
When the last user of a snapshot goes away the snapshot is slided to the
oldest unreferenced version first so that the version is no longer reachable
from partition_entry::read(). The cleaner will then keep merging preceding
(newer) versions into it, until it merges a version which is referenced. The
merging is preemtable. If the initial merging is preempted, the snapshot is
enqueued into the cleaner, the worker woken up, and merging will continue
asynchronously.
When memtable is merged with cache, its cleaner is merged with cache cleaner,
so any outstanding background merges will be continued by the cache cleaner
without disruption.
This reduces scheduling latency spikes in tests/perf_row_cache_update
for the case of large partition with many rows. For -c1 -m1G I saw
them dropping from >23ms to 1-2ms. System-level benchmark using scylla-bench
shows a similar improvement.
"
* tag 'tgrabiec/merge-snapshots-gradually-v4' of github.com:tgrabiec/scylla:
tests: perf_row_cache_update: Test with an active reader surviving memtable flush
memtable, cache: Run mutation_cleaner worker in its own scheduling group
mutation_cleaner: Make merge() redirect old instance to the new one
mvcc: Use RAII to ensure that partition versions are merged
mvcc: Merge partition version versions gradually in the background
mutation_partition: Make merging preemtable
tests: mvcc: Use the standard maybe_merge_versions() to merge snapshots
Tests testing different aspects of `foreign_reader` and
`multishard_combining_reader` are designed to run with a certain minimum
shard count. Running them with any shard count below this minimum makes
them useless at best but can even fail them.
Refuse to run these tests when the shard count is below the required
minimum to avoid an accidental and unnecessary investigation into a
false-positive test failure.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d24159415b6a9d74eafb8355b6e3fba98c1ff7ff.1530274392.git.bdenes@scylladb.com>
The index_reader class public interface has been amended to only deal
with the upper bound cursor along with advancing the lower bound.
Since the class users can only explicitly operate with the lower bound
cursor (take data file position, advance to the next partition, etc), it
no longer makes sense to specify that the method operates on the lower
bound cursor in its name.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
The only case when it needs to be called is when an index_reader is
advanced to a specific partition as part of sstable_reader
initialisation.
Instead, we're passing an optional upper_bound parameter that is used to
call advance_upper_past() internally if partition is found.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
"
When frozen mutation gets deserialised current implementation copies
its value 3 times: from IDL buffer to bytes object, from bytes object to
atomic_cell and then atomic_cell is copied again. Moreover, the value
gets linearised which may cause a large allocation.
All of that is very wasteful. This patch devirtualises and reworks IDL
reading code so that when used with partition_builder the cell value is
copied only once and without linearisation: from the IDL buffer to the
final atomic_cell.
perf_simple_query -c4, medians of 30 results:
./perf_before ./perf_after diff
read 310576.54 316273.90 1.8%
write 359913.15 375579.44 4.4%
microbenchmark, perf_idl:
BEFORE
test iterations median mad min max
frozen_mutation.freeze_one_small_row 2142435 462.431ns 0.125ns 462.306ns 467.659ns
frozen_mutation.unfreeze_one_small_row 1640949 601.422ns 0.082ns 601.340ns 605.279ns
frozen_mutation.apply_one_small_row 1538969 645.993ns 0.405ns 645.588ns 656.510ns
AFTER
test iterations median mad min max
frozen_mutation.freeze_one_small_row 2139548 455.525ns 0.631ns 454.894ns 456.707ns
frozen_mutation.unfreeze_one_small_row 1760139 566.157ns 0.003ns 566.153ns 584.339ns
frozen_mutation.apply_one_small_row 1582050 610.951ns 0.060ns 610.891ns 613.044ns
Tests: unit(release)
"
* tag 'avoid-copy-unfreeze/v2' of https://github.com/pdziepak/scylla:
mutation_partition_view: use column_mapping_entry::is_atomic()
schema: column_mapping_entry: cache abstract_type::is_atomic()
schema: column_mapping_entry: reduce logic duplication
mutation_partition_view: do not linearise or copy cell value
atomic_cell: allow passing value via ser::buffer_view
mutation_partition_view: pass cell by value to visitor
mutation_partition_view: devirtualise accept()
storage_proxy: use mutation_partition_view::{first, last}_row_key()
mutation_partition_view: add last_row_key() and first_row_key() getters
IDL deserialisation code calls is_atomic() for each cell. An additional
indirection and a virtual call can be avoided by caching that value in
column_mapping_entry. There is already very similar optimisation done
for column_definitions.
User-defined constructors often make it more likely that a careless
developer will forget to update one of them when adding a new member to
a structure. The risk of that happening can be reduced by reducing code
duplication with delegating constructors.
mutation_partition_view needs to create an atomic_cell from
IDL-serialised data. Then that cell is passed to the visitor. However,
because generic mutation_partition_visitor interface was used, the cell
was passed by constant reference which forced the visitor to needlessly
copy it.
This patch takes advantage of the fact that mutation_partition_view is
devirtualised now and adjust the interfaces of its visitors so that the
cell can be passed without copying.
There are only two types of visitors used and only one of them appears
in the hot path. They can be devirtualised without too much effort,
which also enables future custom interface specialisations specific to
mutation_partition_views and its users, not necessairly in the scope of
more general mutation_partition_visitor.
Some users (e.g. reconciliation code) need only to know the clustering
key of the first or the last row in the partition. This was done with a
full visitor visiting every single cell of the partition, which is very
wasteful. This patch adds direct getters for the needed information.
It is only being used by index_reader internally and never exposed so
should not be listed in commonly used types.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
An index entry may or may not have a promoted index. All the optional
fields are better scoped under the same class to avoid lots of separate
optional fields and give better representation.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>